5th Users' Conference of IT4Innovations

Name: 5th Users' Conference of IT4Innovations
Start: 2021-11-09T09:00:00+01:00
End: 2021-11-09T16:00:00+01:00
Location: IT4Innovations

9 November 2021

IT4Innovations

Europe/Prague timezone

Support

pr@it4i.cz

PeptideCS: An extensive dataset mapping the potential energy surface of mono- and dipeptides

9 Nov 2021, 14:00

30m

Online (IT4Innovations)

Online

IT4Innovations

Poster Poster session Poster session

Erik Andris (IOCB Prague)

Recent successes of deep-learning based approaches to the protein folding problem^1,2 have greatly relied on the existence of large protein structural³ and genomic⁴ databases. However, since these deep-learning based approaches cannot predict structure exactly, and the sufficiently precise DFT-based methods are computationally too expensive, the final structure refinement step still relies on empirically-constructed force fields. To substitute these force fields for more-accurate machine-learned DFT-based force fields, combining the speed of the empirical force fields with the accuracy of modern DFT methods, suitable training data are needed. Datasets of DFT-based molecular properties of small molecules were first available for equilibrium geometries (QM7, QM-9 and related datasets),⁵ and later extended to off-equilibrium geometries (e.g. ANI-1).⁶, which are a typical occurrence in proteins, but the molecules in these datasets are usually quite small and thus not representative of protein structures. Therefore, we believed that accurate energetic description of protein folding might benefit from a more specialized dataset. Therefore, we created Peptide Conformational Samples dataset (PeptideCS), where we sampled conformations with different dihedral angles in the main chain and side chains of -NH-Me and -CO-NHAc capped amino acids and dipeptides. Similar to others,⁷ we used DFT-B-based GFN2-xTB method⁸ to optimize appropriately constrained geometries, and then calculated energies, energy gradients and atomic charges at the BP86^9,10/DZVP-DFT¹¹ level with COSMO¹² solvation model (water) and, additionaly, energies in water, 1-octanol, N,N-dimethylformamide, and n-hexane solvents with the COSMO-RS¹³ solvation model. Our dataset consists of over 400 million non-equilibrium structures, uniformly sampled in all relevant dihedral angles. We also ran optimization at the GFN2-xTB level to obtain minima on the potential energy surface, resulting in additional 100 million structures. The resulting dataset should thus extensively cover all possible arrangements in these simple peptide building blocks and can be used in development and validation of protein force fields.

References (1) Science 2021, 373, 871–876. (2) Nature 2021, 596, 583–589. (3) Nucleic Acids Res. 2019, 47, D520–D528. (4) Nucleic Acids Res. 2005, 33, 154–159. (5) Sci. Data 2014, 1, 1–7. (6) Sci. Data 2017, 4, 1–8. (7) J. Chem. Inf. Model. 2020, 60, 6135–6146. (8) J. Chem. Theory Comput. 2019, 15, 1652–1671. (9) Phys. Rev. A 1988, 38, 3098–3100. (10) Phys. Rev. B 1986, 33, 8822–8824. (11) J. Chem. Theory Comput. 2017, 13, 3575–3585. (12) J. Phys. Chem. 1995, 99, 2224–2235. (13) J. Phys. Chem. A 1998, 102, 5074–5085.

Erik Andris (IOCB Prague) Tadeáš Kalvoda (IOCB Prague) Lubomir Rulisek

There are no materials yet.

5th Users' Conference of IT4Innovations

Support

PeptideCS: An extensive dataset mapping the potential energy surface of mono- and dipeptides

Online

IT4Innovations

Speaker

Description

Primary authors

Presentation materials

Choose timezone

5th Users' Conference of IT4Innovations

Support

Speaker

Description

Primary authors

Presentation materials