The package will contain a number of modules to define a neural network potential, the interaction with the training dataset, the training itself and the shipping of the trained neural network potential. Training routines need to be flexible yet reproducible, hyper-parameters need to be stored with the training outcome. Models and training routines will all be implemented using pyTorch
.
We will base our work on the following packages for specific tasks (apparently this field has a preference for pip
, so we might want to ensure we are choosing packages available on both conda-forge
and pip
).
Task | Package |
---|---|
Model definition | pytorch |
Model training | pytorch-lightning |
Interaction with QM dataset, spanning QM calculations | QCArchive , QCSubmit |
ML experiments and experiment version control | data version control |
units | openff-units |
visualize training | tensorboard |
Distributed training and hyper-parameter selection and optimization | ray |
A note on units: we will use openff-units
throughout the project at entry points, but internally we will perform calculations without attached units but within a defined unit system. Note qcelemental
(which we will include as a dependence for integration with qcarchive
) works with pint
and it appears all relevant unit conversions have already been defined.
modelforge
datasets/ # retrieve and process datasets to enable efficient training
curation/ #
interface/ # interface to other molecular simulation packages
potential/ # defines all operations needed to implement a nnp
train/ # define hyper paramter and their optimization, training routine,
translate/ # tools for translating models from pytorch into different frameworks
utils/
The dataset module provides functions and classes to retrieve, transforms and stores QM datasets form QCArchive and provides them as torch.DataSet
or LighningDataSet
s to train NNPs.
The dataset module implements the data curation, actions associated with data storage, caching and retrieval as well as the pipeline from the stored hdf5 files to the pytorch dataset
class that can be used for training.
The general workflow to interact with public datasets will be:
dataset
implementationThe specific dateset classes like QM9Dataset
or SPICEDataset
download a hdf5 file with defined key names and values in a specific format from Zenodo and load the data in memory. The values in the dataset need to be specified in the openMM unit system. For each uploaded Zenodo dataset (in hdf5 format) we will generate a README.md that contains all labels and their respective units.
The public API for creating a TorchDataset
is implemented in the specific data classes (e.g. QM9Dataset
) and in the DatasetFactory.
The TorchDataset
can be loaded in a Pytorch
Dataloader.
The utils.py
file contains a SplittingStrategy
base class and a RandomSplittingStrategy
class that takes a TorchDataset
as input and returns three views for training,validation and testing.