Speaker
Description
Recent successes of deep-learning based approaches to the protein folding problem1,2 have greatly relied on the existence of large protein structural3 and genomic4 databases. However, since these deep-learning based approaches cannot predict structure exactly, and the sufficiently precise DFT-based methods are computationally too expensive, the final structure refinement step still relies on empirically-constructed force fields. To substitute these force fields for more-accurate machine-learned DFT-based force fields, combining the speed of the empirical force fields with the accuracy of modern DFT methods, suitable training data are needed. Datasets of DFT-based molecular properties of small molecules were first available for equilibrium geometries (QM7, QM-9 and related datasets),5 and later extended to off-equilibrium geometries (e.g. ANI-1).6, which are a typical occurrence in proteins, but the molecules in these datasets are usually quite small and thus not representative of protein structures. Therefore, we believed that accurate energetic description of protein folding might benefit from a more specialized dataset. Therefore, we created Peptide Conformational Samples dataset (PeptideCS), where we sampled conformations with different dihedral angles in the main chain and side chains of -NH-Me and -CO-NHAc capped amino acids and dipeptides. Similar to others,7 we used DFT-B-based GFN2-xTB method8 to optimize appropriately constrained geometries, and then calculated energies, energy gradients and atomic charges at the BP869,10/DZVP-DFT11 level with COSMO12 solvation model (water) and, additionaly, energies in water, 1-octanol, N,N-dimethylformamide, and n-hexane solvents with the COSMO-RS13 solvation model. Our dataset consists of over 400 million non-equilibrium structures, uniformly sampled in all relevant dihedral angles. We also ran optimization at the GFN2-xTB level to obtain minima on the potential energy surface, resulting in additional 100 million structures. The resulting dataset should thus extensively cover all possible arrangements in these simple peptide building blocks and can be used in development and validation of protein force fields.
References (1) Science 2021, 373, 871–876. (2) Nature 2021, 596, 583–589. (3) Nucleic Acids Res. 2019, 47, D520–D528. (4) Nucleic Acids Res. 2005, 33, 154–159. (5) Sci. Data 2014, 1, 1–7. (6) Sci. Data 2017, 4, 1–8. (7) J. Chem. Inf. Model. 2020, 60, 6135–6146. (8) J. Chem. Theory Comput. 2019, 15, 1652–1671. (9) Phys. Rev. A 1988, 38, 3098–3100. (10) Phys. Rev. B 1986, 33, 8822–8824. (11) J. Chem. Theory Comput. 2017, 13, 3575–3585. (12) J. Phys. Chem. 1995, 99, 2224–2235. (13) J. Phys. Chem. A 1998, 102, 5074–5085.