# marcel.science / repbench

Benchmarking representations of atomistic systems for machine learning

Hello! 👋 This website is a supplement to our paper (see below), in which we review, compare, and benchmark representations of molecules and materials, in the context of using machine learning to interpolate *ab initio* electronic structure calculations. Here, we present the software infrastructure, datasets, and results required to reproduce, interrogate, or (hopefully) build on our results. Enjoy!

Items marked with TODO are not yet ready, please check back soon!

### Paper

- Title: Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning
- Preprint: arxiv.org/abs/2003.12081 (March 2020)

Please contact Matthias for general questions about the paper. To chat about the benchmark in particular, please get in touch with Marcel at langer@fhi-berlin.mpg.de or 🐦 marceldotsci.

### Results

`repbench-results`

: The results presented in the paper: optimised models, learning curves, timings, in`yaml`

format. Also includes the*search spaces*in which the optimised models were found, and plots.- TODO
`repbench-project`

: The (lightly edited)`repbench`

project repository. Intended for transparency and reproducibility, but*not*recommended as starting point for further use. Includes all infrastructure and provenance information, and large amounts of additional data, for instance all optimisation steps for all reruns of the hyper-parameter optimiser, and results for variant models and model implementations.

### Software

The `repbench`

infrastructure is built on `cmlkit`

, extended through a number of specialised plugins.

`cmlkit`

: Extensible python package providing clean and concise infrastructure to specify, tune, and evaluate machine learning models for computational chemistry and condensed matter physics. Also provides interfaces to`RuNNer`

and`quippy`

.`cscribe`

: Plugin to interface with`dscribe`

(for the benchmark, we only use the SOAP implementation).`skrrt`

: Plugin for tuning of kernel ridge regression (KRR) hyper-parameters with cross-validation.`mortimer`

: Plugin for timings.`qmmlpack`

: Basis for`cmlkit`

, providing low-level implementation of KRR, atomic kernels, and MBTR.`cmlkit`

tutorial: Introduction and "guided tour" of`cmlkit`

, provided as part of the NOMAD Analytics Toolkit.

Using these tools, the `repbench`

infrastructure then implements a minimal command line interface (`falco`

) and additional custom components. For the paper, this infrastructure was maintained alongside the results in a project repository. In order to enable further, separate work, we will release a standalone version of the infrastructure with an example project repository soon.

### Datasets

`repbench-datasets`

: Repository with the datasets used in `repbench`

, including all splits (inner splits for HP optimisation, outer train/test splits for evaluations), in `cmlkit`

format. The scripts used to convert from the original sources are also provided.

### Tooling

TODO `repbench-skeleton`

: Repository with a streamlined version of the `repbench`

infrastructure, provided as a starting point for future experiments with additional representations.

### Summary

In the context of this work, a *representation* is a vector describing a *system* (a molecule or material, i.e. a periodic crystal), which is computed based only on: 1) the positions of the atoms, 2) their atomic number, and 3), if applicable, the crystal basis vectors. Such representations are used as *features* for regression methods, where we try to interpolate between the results of *ab initio* calculations (here: DFT), which provide the “ground truth”. Representations are used because the “raw” coordinates do not reflect the intrinsic symmetries of the physics we’re modelling, for instance, they’re not rotationally or translationally invariant, and because they’re somewhat awkward to work with technically.

We compare representations of this type from a conceptual and empirical standpoint. On the conceptual level, we show what common techniques are typically used to construct representations, and present (building on other work) a basic mathematical formulation of these techniques. On the empirical level, the *representation benchmark* (`repbench`

), we compare three representations: Smooth Overlap of Atomic Positions (SOAP), Symmetry Functions (SF) and the Many-Body Tensor Representation (MBTR). For all three, we use the same regression method (KRR) and a consistent, automatic way to tune the hyper-parameters. We also use the same stratified splits for all datasets for HP tuning, and to evaluate the final results. Specifically, we evaluate the prediction errors for energies in molecules (QM9 dataset) and materials (BA10 and NMD18), at different training set sizes, generating learning curves. We also time how long it takes to compute representations and kernel matrices for each.

We find the following:
1. Across representations and datasets, going from *k=2* (meaning, representations that are built from functions that take the positions of two atoms into account, like a distance) to *k=2,3* (adding three-atom information, like angles) leads to lower prediction errors.
2. Errors are, mostly, lowest for SOAP, then SF, then MBTR. This might be because local representations (which generate one representation per atom) are superior to global ones (which describe the whole structure at once), but we only have on global representation, so it might be an issue with the MBTR in particular.
3. Generally, computational cost of computing representations and kernel matrices can be traded off against prediction errors, with multiple pareto-optimal choices existing depending on the target accuracy.
4. When using unrelaxed structures but relaxed energies, the “noise” induced by relaxation dominates the difference between representations.