30–31 Oct 2025
IT4Innovations
Europe/Prague timezone

Use of open data for training machine-learning interatomic potentials

30 Oct 2025, 18:38
1m
atrium (IT4Innovations)

atrium

IT4Innovations

Studentská 6231/1B 708 00 Ostrava-Poruba
Poster Materials Science (e.g. Computational/Theoretical/Physical Chemistry, Soft Matter, Polymer Research) Conference Dinner and Poster Session

Speaker

Šimon Kratochvíl

Description

With machine learning and its use in science rapidly growing in popularity, the need for high-quality training data is increasing. Most researchers however train their models either on their own data or on curated databases. With the growing emphasis on open science, a large amount of data from other researchers is now openly available, but such data often come without any guarantee of quality, thus its suitability for machine learning is uncertain.

In this work, we assess the quality of the data in NOMAD, the largest open materials simulation database, and its practical applicability for training machine-learning interatomic potentials for atomistic simulations. We present a workflow designed to tackle several challenges associated with the NOMAD data: automatically filtering out results with low numerical accuracy, deduplicating structures in the training data, and combining results coming from multiple DFT implementations with different total energy offsets.

With this workflow, we have successfully trained silicon-based potentials using only simulations from NOMAD as training data. The resulting potential predicts phase stability at a level comparable to state-of-the-art potentials, while also accurately describing large-scale atomic systems, even at high temperatures. This demonstrates that using open data can significantly reduce the time and costs required to generate suitable training datasets for machine-learning interatomic potentials.

Primary author

Šimon Kratochvíl

Co-author

Pavel Ondračka (MUNI)

Presentation materials

There are no materials yet.