Speaker
Description
With machine learning and its use in science rapidly growing in popularity, the need for high-quality training data is increasing. Most researchers however train their models either on their own data or on curated databases. With the growing emphasis on open science, a large amount of data from other researchers is now openly available, but such data often come without any guarantee of quality, thus its suitability for machine learning is uncertain.
In this work, we assess the quality of the data in NOMAD, the largest open materials simulation database, and its practical applicability for training machine-learning interatomic potentials for atomistic simulations. We present a workflow designed to tackle several challenges associated with the NOMAD data: automatically filtering out results with low numerical accuracy, deduplicating structures in the training data, and combining results coming from multiple DFT implementations with different total energy offsets.
With this workflow, we have successfully trained silicon-based potentials using only simulations from NOMAD as training data. The resulting potential predicts phase stability at a level comparable to state-of-the-art potentials, while also accurately describing large-scale atomic systems, even at high temperatures. This demonstrates that using open data can significantly reduce the time and costs required to generate suitable training datasets for machine-learning interatomic potentials.