|𝔻⟩irac's Student

Say your given an all-knowing oracle machine that you can query to get the ground truth data but the thing is this oracle system is fairly expensive¹ to query. Furthermore say that your not really interested in all ground truths the oracle has to offer, but rather you just want it for a specific set of inquiries. What would you do? the most direct thing would be to carefully curate a set of inquiries that are the most valuable and then spend your resources to get the answers and use those to teach a sage system. This idea is not new and is a form of model compression/distillation [1] and teacher-student paradigm [2].

Oracle 🧙‍♂️ > Sage 🧑‍🏫

If our oracle is a all-knowing system then a sage system is one that has equivalent deep wisdom and knoweldge on specific area(s) as the oracle but is inferior in all other areas. Our oracle is all-knowing but slow (think of an old professor 🧙‍♂️) but our sage is more youthful, but less expierienced (i.e., assistant professor 🧑‍🏫)

For the curation of the inquiries, one could use a Bayesian optimization approach that would iteratively select the most valuable inquiries to query the oracle based on the aquisition function the user chooses. This could be a exploration or exploitation function, that is do you want to find a broad set of valuable inquiries or do you want to find a specific set of valuable inquiries. Then there are more traditional statistical sampling methods. I won't touch on this since its not the focus of this post, but keep that in mind.

This is the idea behind synthetic dataset generation, that is the creation of inquiries and answers (i.e., labels) to create a dataset that is dervived from the oracle/teacher but is not a direct copy from the corpus of knowledge/truths. This is a form of data augmentation and data compression.

Once you have the set of inquiries and the answers, you can faithful teach the a sage system. This leads to what is called model distillation. Note this isn't model compression since we didn't take the underlying knowledge-structure of the oracle and modify it, we just extract stochastic answers from the oracle. Typically the model is more efficient and computationallyheaper to both train and infer with.

In the wild

The recent paper by Gardner et al. [3] provides a nice demonstration of synthetic dataset generation and model distillation for creating fast and accurate, chemical system specific, interatomic ML potentials. Their work offers a general, architecture-agnostic protocol for distilling knowledge from large, expensive "foundation models" (our Oracle 🧙‍♂️) into smaller, more efficient "student" models (our Sage 🧑‍🏫).

The process is essentially a three-step workflow:

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#4f46e5','primaryTextColor':'#ffffff','primaryBorderColor':'#3730a3','lineColor':'#6b7280','sectionBkgColor':'#f8fafc','altSectionBkgColor':'#e2e8f0','gridColor':'#e5e7eb','tertiaryColor':'#f1f5f9'}}}%% graph LR; subgraph S1 ["Step 1: Fine-tuning"] A[/"Small DFT
dataset"/] --> C; B["Foundation
Model"] --> C{"Fine-tune"}; C --> D(("Oracle")); end subgraph S2 ["Step 2: Synthetic Data"] D --> E{"Generate &
Label"}; E --> F[/"Large Synthetic
Dataset"/]; end subgraph S3 ["Step 3: Distillation"] F --> G{"Train
Student"}; H["Fast
Architecture"] --> G; G --> I(("Sage")); end

Fine-tuning the Teacher: They start with a pre-trained foundation model (like MACE-MP-0). These models are powerful in that they support chemical systems of any composition including atomic numbers 1-89 (or so). The downside is the architecture is slow to evaluate and there fore not suitble for large-scale atomic simulations (e.g. 100,000's of atoms). As part of Gardner et al. protocol they fine-tune the teacher/oracle² model on a very small set of high-quality, domain-specific calculations (e.g, ~25 DFT calculations for liquid water). This turns³ the general model into a specialized "teacher" for that specific system, without the high cost of creating a large DFT dataset from scratch and still remaining general to all other chemical systems.
Generating Synthetic Data: For the liquid water system, the fine-tuned teacher is used to generate a large dataset of ~1,000 atomic configurations and their corresponding energies and forces are labeled with the fine-tuned teacher. For sampling, they employ a fairly efficient "rattle-and-relax" scheme that constructs a "family tree" of configurations. At each step, a "parent" structure is selected from the tree, a new "child" structure is generated from it, and this child is added back to the tree (see Figure 1). This is much faster than running molecular dynamics simulations to explore the potential energy surface. The scheme works by taking a starting structure and iteratively:
- Rattling: Displace the atomic positions \( \mathbf{R} \) and unit cell \( C_0 \) by applying random perturbations:
  - \( \mathbf{R}' \leftarrow [(\mathbf{A} + \mathbf{I}) \times \mathbf{R}] + \mathbf{B} \)
  - \( \mathbf{C}' \leftarrow (\mathbf{A} + \mathbf{I}) \times \mathbf{C}_0 \)
- Relaxing: Nudging the atoms in the direction of the forces predicted by the teacher model to find a new, stable configuration. At each relaxation step ( x ), the atomic positions ( \mathbf{R} ) are updated according to the equation:
  
  \[ \mathbf{R}' \leftarrow \mathbf{R} + \frac{\sigma_B}{x} \cdot \frac{\mathbf{F}}{||\mathbf{F}||} \]
  
  where \( \mathbf{F} \) represents the forces predicted by the teacher model, and \( \sigma_B \) is a hyperparameter controlling the rattle intensity that is also used to scale the relaxation step size. This iterative process — the Robbins-Monro algorithm — allows for an efficient exploration of the local energy landscape to generate new structures.
Distilling the Student (Sage): Finally, they train a much smaller, computationally cheaper model, the "student" (our Sage 🧑‍🏫), on this large synthetic dataset. Because the student model architecture (e.g., PaiNN or ACE) is simpler and has fewer parameters, both training and inference are much faster.


Figure 1. Synthetic data generation process (Fig. 6 from Gardner et al.).

As a proof-of-concept for liquid water, thier results suggest that you can get massive speed-ups with only a minor hit to accuracy.

Results from distilling a fine-tuned **MACE-MP-0** foundation model for liquid water.
Model	Type	Relative Speed	Force MAE (meV/Å)
MACE-MP-0b3	Teacher (Fine-tuned, Oracle 🧙‍♂️)	1x	32
TensorNet	Student (Sage 🧑‍🏫)	> 10x	37
PaiNN	Student (Sage 🧑‍🏫)	> 30x	39
ACE	Student (Sage 🧑‍🏫)	> 100x	51

The student/sage models: PaiNN, TensorNet, and ACE are more computationally efficient for a couple of key reasons (see Figure 2). First being less complexity in the model, i.e., less parameters and layers. The next is that these models have computational cost scales dramatically with the interaction cut-off radius (\(r\)), often as \( \mathcal{O}(r^3) \). The big foundation models need a large radius (e.g., 6 Å) to be general to capture many-body interactions. However, sage/student models, however, can get away with a smaller radius (e.g., 4.5 Å for PaiNN) without losing much accuracy for the specific system they're trained on. This hyperparameter adjustment can lead to large reductions in computational cost, which is how the authors get >10x to >100x speedups. They are still "good" because they have learned the essential physics for that one system from the huge synthetic dataset provided by the all-knowing (but slow) teacher.

Figure 2. Computational efficiency of distilled models (Fig. 2 from Gardner et al.).

Outlook

The synthetic dataset generation and distillation model isn't just a numerical trick; the resulting sage models can produce physically meaningful simulations. For the liquid water, Gardner et al. report that the distilled models reproduce the structural properties of the much more expensive teacher model well. The distilled PaiNN model, for example, almost perfectly reproduces the radial distribution function (RDF).

Other structural and dynamical properties such as ring-size distributions and tetrahedral order parameters, which describe the medium-range ordering and local geometry, show the PaiNN student model behaves similarly to the teacher. The ACE model, which is very fast, leads to a slightly more ordered, crystal-like water structure, but still captures the key chemical/material physics.

Broader Applicability

The Garnder et al. paper [3] appears to be a versatility distillation method. Demonstrations on some systems included are:

Metallic Hydrogen: Distilled models seem to reproduced DFT EOS for hydrogen at the extreme temperatures and pressures.
Porous Amorphous Silica: The distilled models correctly captured the structure factor for porous amorphous silica.
Hybrid Perovskites: The distilled models replicate the rotational dynamics of the molecular cations inside frameworks.
Organic Reactions: Distilled models successfully reproduced the intended SN2 reaction mechanism. However, long-timescale simulations led to unphysical products.

Prospects

This fairly broad range of applications showcases that the distillation protocol is a powerful and general tool for creating more computationally efficient, specialized potentials for higher-throughput research. I'm bullish on this approach and think this is one-direction atomistic modeling and simulation is headed if there is no need for electronic effects are needed; although who knows these models might get good enough to predict charge transfer and polarization effects.

Footnotes

Expensive here can mean time to get the answer or cost in terms of resources. ↩
The all-knowing oracle in this fine-tuning process would be the DFT calculations, although this doesn't truly fit the notion of a black-box since we know the physics that is being solved, in contrast with GNN potentials we kind of know what embeddings it is learning. So in the protocol of Gardner et al., what we have is the all-knowing DFT oracle and a semi-oracle (fine-tuned model) that then teaches the sage. ↩
Fine-tuning is a technique in machine learning where a pre-trained model is adapted to a specific task by updating its parameters on a smaller dataset. This is often used to improve the performance of a model. While this helps the student model significantly, I feel the better approach is for the community to create larger more accurate foundation models and just use those as a true oracle. ↩

References

[1] G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network, (2015). DOI.

[2] C. Buciluǎ, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Philadelphia PA USA, 2006: pp. 535–541. DOI.

[3] J.L.A. Gardner, et al., Distillation of atomistic foundation models across architectures and chemical domains, (2025). DOI.

Reuse and Attribution