marcel.science / repbench
Benchmarking representations of atomistic systems for machine learning

Hello! 👋 This website is a supplement to our paper (see below), in which we review, compare, and benchmark representations of molecules and materials, in the context of using machine learning to interpolate ab initio electronic structure calculations. Here, we present the software infrastructure, datasets, and results required to reproduce, interrogate, or (hopefully) build on our results. Enjoy!

Paper

Title: Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning
Authors: Marcel F. Langer, Alex Goeßmann, Matthias Rupp
Published: npj Computational Materials 8, 41 (2022)
DOI: 10.1038/s41524-022-00721-x
Preprint: arxiv.org/abs/2003.12081 (2020)

Please contact Matthias for general questions about the paper. To chat about the benchmark in particular, please get in touch with Marcel at langer@fhi-berlin.mpg.de or 🐦 marceldotsci.

Results

repbench-results: The results presented in the paper: optimised models, learning curves, timings, in yaml format. Also includes the search spaces in which the optimised models were found, and plots.

repbench-project: The (lightly edited) repbench project repository. Intended for transparency and reproducibility, but not recommended as starting point for further use. Includes all infrastructure and provenance information, and large amounts of additional data, for instance all optimisation steps for all reruns of the hyper-parameter optimiser, and results for variant models and model implementations.

Software

The repbench infrastructure is built on cmlkit, extended through a number of specialised plugins.

cmlkit: Extensible python package providing clean and concise infrastructure to specify, tune, and evaluate machine learning models for computational chemistry and condensed matter physics. Also provides interfaces to RuNNer and quippy.
cscribe: Plugin to interface with dscribe (for the benchmark, we only use the SOAP implementation).
skrrt: Plugin for tuning of kernel ridge regression (KRR) hyper-parameters with cross-validation.
mortimer: Plugin for timings.
qmmlpack: Basis for cmlkit, providing low-level implementation of KRR, atomic kernels, and MBTR.
cmlkit tutorial: Introduction and "guided tour" of cmlkit, provided as part of the NOMAD Analytics Toolkit.

Using these tools, the repbench infrastructure then implements a minimal command line interface (falco) and additional custom components. For the paper, this infrastructure was maintained alongside the results in a project repository. In order to enable further, separate work, we will release a standalone version of the infrastructure with an example project repository soon.

Datasets

repbench-datasets: Repository with the datasets used in repbench, including all splits (inner splits for HP optimisation, outer train/test splits for evaluations), in cmlkit format. The scripts used to convert from the original sources are also provided.

Summary

In the context of this work, a representation is a vector describing a system (a molecule or material, i.e. a periodic crystal), which is computed based only on: 1) the positions of the atoms, 2) their atomic number, and 3), if applicable, the crystal basis vectors. Such representations are used as features for regression methods, where we try to interpolate between the results of ab initio calculations (here: DFT), which provide the “ground truth”. Representations are used because the “raw” coordinates do not reflect the intrinsic symmetries of the physics we’re modelling, for instance, they’re not rotationally or translationally invariant, and because they’re somewhat awkward to work with technically.

We compare representations of this type from a conceptual and empirical standpoint. On the conceptual level, we show what common techniques are typically used to construct representations, and present (building on other work) a basic mathematical formulation of these techniques. On the empirical level, the representation benchmark (repbench), we compare three representations: Smooth Overlap of Atomic Positions (SOAP), Symmetry Functions (SF) and the Many-Body Tensor Representation (MBTR). For all three, we use the same regression method (KRR) and a consistent, automatic way to tune the hyper-parameters. We also use the same stratified splits for all datasets for HP tuning, and to evaluate the final results. Specifically, we evaluate the prediction errors for energies in molecules (QM9 dataset) and materials (BA10 and NMD18), at different training set sizes, generating learning curves. We also time how long it takes to compute representations and kernel matrices for each.

We find the following: 1. Across representations and datasets, going from k=2 (meaning, representations that are built from functions that take the positions of two atoms into account, like a distance) to k=2,3 (adding three-atom information, like angles) leads to lower prediction errors. 2. Errors are, mostly, lowest for SOAP, then SF, then MBTR. This might be because local representations (which generate one representation per atom) are superior to global ones (which describe the whole structure at once), but we only have on global representation, so it might be an issue with the MBTR in particular. 3. Generally, computational cost of computing representations and kernel matrices can be traded off against prediction errors, with multiple pareto-optimal choices existing depending on the target accuracy. 4. When using unrelaxed structures but relaxed energies, the “noise” induced by relaxation dominates the difference between representations.