repbench
Benchmarking representations of atomistic systems for machine learning

Hello! 👋 This website is a supplement to our paper (see below), in which we review, compare, and benchmark representations of molecules and materials, in the context of using machine learning to interpolate ab initio electronic structure calculations. Here, we present the software infrastructure, datasets, and results required to reproduce, interrogate, or (hopefully) build on our results. Enjoy!

Items marked with TODO are not yet ready, please check back soon!

Paper

Please contact Matthias for general questions about the paper. To chat about the benchmark in particular, please get in touch with Marcel at langer@fhi-berlin.mpg.de or 🐦 marceldotsci.

Results

Software

The repbench infrastructure is built on cmlkit, extended through a number of specialised plugins.

Using these tools, the repbench infrastructure then implements a minimal command line interface (falco) and additional custom components. For the paper, this infrastructure was maintained alongside the results in a project repository. In order to enable further, separate work, we will release a standalone version of the infrastructure with an example project repository soon.

Datasets

repbench-datasets: Repository with the datasets used in repbench, including all splits (inner splits for HP optimisation, outer train/test splits for evaluations), in cmlkit format. The scripts used to convert from the original sources are also provided.

Tooling

TODO repbench-skeleton: Repository with a streamlined version of the repbench infrastructure, provided as a starting point for future experiments with additional representations.

Summary

In the context of this work, a representation is a vector describing a system (a molecule or material, i.e. a periodic crystal), which is computed based only on: 1) the positions of the atoms, 2) their atomic number, and 3), if applicable, the crystal basis vectors. Such representations are used as features for regression methods, where we try to interpolate between the results of ab initio calculations (here: DFT), which provide the “ground truth”. Representations are used because the “raw” coordinates do not reflect the intrinsic symmetries of the physics we’re modelling, for instance, they’re not rotationally or translationally invariant, and because they’re somewhat awkward to work with technically.

We compare representations of this type from a conceptual and empirical standpoint. On the conceptual level, we show what common techniques are typically used to construct representations, and present (building on other work) a basic mathematical formulation of these techniques. On the empirical level, the representation benchmark (repbench), we compare three representations: Smooth Overlap of Atomic Positions (SOAP), Symmetry Functions (SF) and the Many-Body Tensor Representation (MBTR). For all three, we use the same regression method (KRR) and a consistent, automatic way to tune the hyper-parameters. We also use the same stratified splits for all datasets for HP tuning, and to evaluate the final results. Specifically, we evaluate the prediction errors for energies in molecules (QM9 dataset) and materials (BA10 and NMD18), at different training set sizes, generating learning curves. We also time how long it takes to compute representations and kernel matrices for each.

We find the following: 1. Across representations and datasets, going from k=2 (meaning, representations that are built from functions that take the positions of two atoms into account, like a distance) to k=2,3 (adding three-atom information, like angles) leads to lower prediction errors. 2. Errors are, mostly, lowest for SOAP, then SF, then MBTR. This might be because local representations (which generate one representation per atom) are superior to global ones (which describe the whole structure at once), but we only have on global representation, so it might be an issue with the MBTR in particular. 3. Generally, computational cost of computing representations and kernel matrices can be traded off against prediction errors, with multiple pareto-optimal choices existing depending on the target accuracy. 4. When using unrelaxed structures but relaxed energies, the “noise” induced by relaxation dominates the difference between representations.