Issue: It would make sense to use pytorch-lightning
to perform the training, ray
to distribute training over compute instances and dvc
to control every aspect that should be reproducible (pytorch-lighning
, ray
and dvc
can be easily combined). The latter includes the training dataset. Ideally, we want to define a given ml experiment in one or more dvc.yaml
files so they can be easily reproduced, but also to make any change to training/data processing traceable. We have worked under the assumption that we should use QCArchive
as our primary data API and upload to zenodo hdf5 for each dataset. An alternative would be dataset-registry
from dvc. Should we focus on the dvc
as our primary storage hub for all the data that is needed for a ml experiment? Instead of focusing on a mixture of zenodo/QCArchive, we can use dvc
for model, parameter (and hyperparameter including everything needed to reproduce the training) and dataset storage/management (I propose to use it for model and parameter storage in any case). QCArchive would then only be used for the active training pipeline to generate new data points.
What we decided: We decided to use QCArchive
or other data sources for data, but then restructure this data into hdf5
files that we will upload to zenodo. These hdf5 files will serve as the main way to load data into the package. Since different datasets may have different properties, different formats, and potentially different labels for the same properties, we will need to create unique converters for each dataset, but we should be able to mostly standardize the hdf5 files we generate, and thus loading into modelforge should be similar for each dataset. Implementing individual classes (i.e., a custom child class of dataset) would resolve any subtle issues between data formats that may occur.
Issue: the scope of QCArchive for retrieving and storing datasets is not fully clear. Currently (2023 August), support for the old version has effectively stopped in favor of development of the new version (and site review), however, the new version (’next’ branch) is still not quite ready for production. Problems we’ve encountered with QCArchive (and their solutions):
concurrent.futures
to perform multithreading (48 threads appeared optimal on Chris’ 8 core machine), reducing download time to a more reasonable 30 minutes.
sqlitedict
to store records in a persistent local database to avoid refetching records that have already been downloaded