Notes / Discussion points

Issue: It would make sense to use pytorch-lightning to perform the training, ray to distribute training over compute instances and dvc to control every aspect that should be reproducible (pytorch-lighning, ray and dvc can be easily combined). The latter includes the training dataset. Ideally, we want to define a given ml experiment in one or more dvc.yaml files so they can be easily reproduced, but also to make any change to training/data processing traceable. We have worked under the assumption that we should use QCArchive as our primary data API and upload to zenodo hdf5 for each dataset. An alternative would be dataset-registry from dvc. Should we focus on the dvc as our primary storage hub for all the data that is needed for a ml experiment? Instead of focusing on a mixture of zenodo/QCArchive, we can use dvc for model, parameter (and hyperparameter including everything needed to reproduce the training) and dataset storage/management (I propose to use it for model and parameter storage in any case). QCArchive would then only be used for the active training pipeline to generate new data points.

What we decided: We decided to use QCArchive or other data sources for data, but then restructure this data into hdf5 files that we will upload to zenodo. These hdf5 files will serve as the main way to load data into the package. Since different datasets may have different properties, different formats, and potentially different labels for the same properties, we will need to create unique converters for each dataset, but we should be able to mostly standardize the hdf5 files we generate, and thus loading into modelforge should be similar for each dataset. Implementing individual classes (i.e., a custom child class of dataset) would resolve any subtle issues between data formats that may occur.

Issue: the scope of QCArchive for retrieving and storing datasets is not fully clear. Currently (2023 August), support for the old version has effectively stopped in favor of development of the new version (and site review), however, the new version (’next’ branch) is still not quite ready for production. Problems we’ve encountered with QCArchive (and their solutions):