Came across this very useful package that makes loading atomic datasets that contain coordinates, energy, and forces for use in fitting interatomic potentials. It's called Load Atoms and has super simple syntax and seems to be very fast. Also caches the datasets so you don't always re-download. It's really easy to get started.
pip install load-atoms
from load_atoms import load_dataset
dataset = load_dataset("QM9")
This will return a info card:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ QM9 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ Downloading dsgdb9nsd.xyz.tar.bz2 โโโโโโโโโโโโโโโโโโโโ 100% 00:09 โ
โ Extracting dsgdb9nsd.xyz.tar.bz2 โโโโโโโโโโโโโโโโโโโโ 100% 00:18 โ
โ Processing files โโโโโโโโโโโโโโโโโโโโ 100% 00:19 โ
โ Caching to disk โโโโโโโโโโโโโโโโโโโโ 100% 00:02 โ
โ โ
โ The QM9 dataset is covered by the CC0 license. โ
โ Please cite the QM9 dataset if you use it in your work. โ
โ For more information about the QM9 dataset, visit: โ
โ load-atoms/QM9 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You can then begin using the dataset for any machine learning work. There are also utility methods to prepare/split the dataset. Overall, just a very convenient tool I hope grows in number of available datasets.
Side Treat: Visualization
The other thing I really like is the visualization utility that is built into the package. It uses X3DOM via ASE io for html write format, which I didn't even know existed! You can use this option to display your structures in a Jupyter notebook. The Load Atoms package viewer produces X3DOM HTML visualizations that look a little better than the default ASE outputs. Here is some code snippet and resulting view from Load Atoms:
from ase.build import graphene_nanoribbon
from load_atoms import view
nanoribbon = graphene_nanoribbon(3, 4,
type='armchair',
saturated=True,
vacuum=3.5)
nanoribbon.rotate(v='z', a=90, rotate_cell=True)
view(nanoribbon, show_bonds=True)
Available Datasets
It has most datasets that have been used to test different ML and NN models in chemistry, e.g., QM9, but its missing large datasets like the materials project or AFLOW. My guess is it will get there but might be license or size issues (๐คทโโ๏ธ). Here is the list that is available:
Dataset | Elements | # Atoms | # Structures | License | Year |
---|---|---|---|---|---|
AC-2D-22 | C | 30,000 | 150 | CC BY-NC-SA 4.0 | 2022 |
C-GAP-17 | C | 284,965 | 4,530 | CC BY-NC-SA 4.0 | 2017 |
C-GAP-20U | C | 400,275 | 6,088 | GPLv3 | 2020 |
C-SYNTH-23M | C | 23,041,200 | 115,206 | MIT | 2022 |
GST-GAP-22 | Ge, Sb, Te | 341,132 | 2,692 | CC BY 4.0 | 2022 |
P-GAP-20 | P | 140,910 | 4,798 | CC BY 4.0 | 2020 |
QM7 | H, C, N, O, S | 110,650 | 7,165 | None | 2012 |
QM9 | H, C, N, O, F | 2,407,753 | 133,885 | CC0 | 2014 |
Si-GAP-18 | Si | 171,815 | 2,475 | CC BY-NC-SA 4.0 | 2018 |
SiO2-GAP-22 | O, Si | 268,118 | 3,074 | CC BY 4.0 | 2022 |
As you can see these are fairly limited datasets in terms of the species they include and structures. Its a good start though.
@misc{Bringuier_27FEB2024,
title = {Atomic Dataset Convenience Tool},
author = {Bringuier, Stefan},
year = 2024,
month = feb,
url = {https://www.diracs-student.blog/2024/02/}#
{atomic-dataset-python-convenience-pkg.html},
note = {Accessed: 2025-04-08},
howpublished = {Dirac's Student [Blog]},
}