Discussions | Notion

Notes / Discussion points

Issue: It would make sense to use pytorch-lightning to perform the training, ray to distribute training over compute instances and dvc to control every aspect that should be reproducible (pytorch-lighning, ray and dvc can be easily combined). The latter includes the training dataset. Ideally, we want to define a given ml experiment in one or more dvc.yaml files so they can be easily reproduced, but also to make any change to training/data processing traceable. We have worked under the assumption that we should use QCArchive as our primary data API and upload to zenodo hdf5 for each dataset. An alternative would be dataset-registry from dvc. Should we focus on the dvc as our primary storage hub for all the data that is needed for a ml experiment? Instead of focusing on a mixture of zenodo/QCArchive, we can use dvc for model, parameter (and hyperparameter including everything needed to reproduce the training) and dataset storage/management (I propose to use it for model and parameter storage in any case). QCArchive would then only be used for the active training pipeline to generate new data points.

What we decided: We decided to use QCArchive or other data sources for data, but then restructure this data into hdf5 files that we will upload to zenodo. These hdf5 files will serve as the main way to load data into the package. Since different datasets may have different properties, different formats, and potentially different labels for the same properties, we will need to create unique converters for each dataset, but we should be able to mostly standardize the hdf5 files we generate, and thus loading into modelforge should be similar for each dataset. Implementing individual classes (i.e., a custom child class of dataset) would resolve any subtle issues between data formats that may occur.

Issue: the scope of QCArchive for retrieving and storing datasets is not fully clear. Currently (2023 August), support for the old version has effectively stopped in favor of development of the new version (and site review), however, the new version (’next’ branch) is still not quite ready for production. Problems we’ve encountered with QCArchive (and their solutions):

The next branch does not provide an hdf5 dataview (as was available for some datasets in the old version). Datasets must be retrieved record-by-record which is slow. It took ~15 hours to fetch the QM9 dataset of ~135K records.
- Sped this up using concurrent.futures to perform multithreading (48 threads appeared optimal on Chris’ 8 core machine), reducing download time to a more reasonable 30 minutes.
  - Todo: discuss with Ben plans/options during weekly QCArchie sync to improve ability to download whole datasets.
Frequently lose connection to QCPortal while fetching records
- Code implemented to reconnect in case of a failure automatically. Use sqlitedict to store records in a persistent local database to avoid refetching records that have already been downloaded
  - Todo: discuss with Ben plans/options during weekly QCArchie sync for local storage (relates to point above as well re: hdf5). Note DBM is implemented in the code (although not in the conda release yet) to avoid refetching records that have already been downloaded within a python kernel, but this appears to be a temporary cache, not a local storage solution.
In the next branch, the SPICE datasets do not seem to include the actual QM data, only the molecule overview.
- Todo: discuss during QCArchive meeting