The package will contain a number of modules to define a neural network potential, the interaction with the training dataset, the training itself and the shipping of the trained neural network potential. Training routines need to be flexible yet reproducible, hyper-parameters need to be stored with the training outcome. Models and training routines will all be implemented using pyTorch.

We will base our work on the following packages for specific tasks (apparently this field has a preference for pip, so we might want to ensure we are choosing packages available on both conda-forge and pip).

Task Package
Model definition pytorch
Model training pytorch-lightning
Interaction with QM dataset, spanning QM calculations QCArchive, QCSubmit
ML experiments and experiment version control data version control
units openff-units
visualize training tensorboard
Distributed training and hyper-parameter selection and optimization ray

A note on units: we will use openff-units throughout the project at entry points, but internally we will perform calculations without attached units but within a defined unit system. Note qcelemental (which we will include as a dependence for integration with qcarchive) works with pintand it appears all relevant unit conversions have already been defined.

Package outline

Overview

modelforge
	datasets/ # retrieve and process datasets to enable efficient training  
	curation/ # 
	interface/ # interface to other molecular simulation packages
	potential/ # defines all operations needed to implement a nnp
	train/ # define hyper paramter and their optimization, training routine, 
	translate/ # tools for translating models from pytorch into different frameworks
	utils/

Dataset module

The dataset module provides functions and classes to retrieve, transforms and stores QM datasets form QCArchive and provides them as torch.DataSet or LighningDataSets to train NNPs.

The dataset module implements the data curation, actions associated with data storage, caching and retrieval as well as the pipeline from the stored hdf5 files to the pytorch dataset class that can be used for training.

The general workflow to interact with public datasets will be:

  1. obtaining the dataset
  2. processing the dataset and storing it in a hdf5 file with standard naming and units
  3. uploading to zenodo and updating the retrieval link in the dataset implementation

The specific dateset classes like QM9Dataset or SPICEDataset download a hdf5 file with defined key names and values in a specific format from Zenodo and load the data in memory. The values in the dataset need to be specified in the openMM unit system. For each uploaded Zenodo dataset (in hdf5 format) we will generate a README.md that contains all labels and their respective units.

The public API for creating a TorchDataset is implemented in the specific data classes (e.g. QM9Dataset) and in the DatasetFactory. The TorchDataset can be loaded in a Pytorch Dataloader.

The utils.py file contains a SplittingStrategy base class and a RandomSplittingStrategy class that takes a TorchDataset as input and returns three views for training,validation and testing.