Search Blogs

Saturday, September 13, 2025

Conformal Prediction

An area that has gained my specific attention lately is the idea of trying to quantify how confident a model is in its prediction. There are various techniques for doing this, the familiar Bayesian methods are some of the most popular, while others are still new to me, the one that I've been reading on this weekend is the idea of conformal prediction [1]. So what is Conformal Prediction? A Conformal prediction is a method for uncertainty quantification where instead of giving a single prediction we provide a set (e.g., ${y_1, y_2, ..., y_n}$) of most predicted values with a guarantee about how frequently the true value will be in that set. To put it even more simply, the model spits out a bunch of possible answers given several inputs and we just ask, “Is the real answer somewhere in this pile? If so are you 95% sure?”.

This idea is more aligned with the frequentist thinking, where we don't concern ourselves with a prior or the full posterior distribution, but only seeking how well the prediction is in the set.

The prediction interval when doing regression is then some range where we specify the values of the predictions should fall in. So say its the bandgap we are predicting the we specify say , 3.0 to 3.5 eV and we attached some confidence level that the true bandgap value will be in that range. If we are focused on assigning category labels, the it would be something like the crystal structure is either rocksalt, zincblende, or perovskite and again the confidence that the true label will be in that set is given some value (e.g. 90%).

Important Distinction

These are prediction intervals, not confidence intervals. Prediction intervals quantify uncertainty about individual predictions, while confidence intervals quantify uncertainty about distribution parameters like the mean.

Another aspect to keep in mind is that conformal prediction uses calibration residuals from a held-out set to size the prediction interval. This differs from Bayesian or bootstrap methods, which model the data-generating process (e.g., via likelihoods) to derive uncertainty.

Lets put the conformal predictions words into math. First we define $P(\cdot)$ as the probability of some value occurring ($P(E_b=3.1\text{eV})$), then we define the true value as $E_b^{n+1}$, where the $n+1$ is the new input. Now we define the prediction interval as $\Gamma(X_{n+1})$, where $X_{n+1}$ is the new input, so again this just tells us given an input evaluate a set-valued function that returns a set of possible values. For regression, this is typically a continuous interval like $[3.1, 3.5]$ eV for the bandgap prediction. Then we have to define our confidence level, i.e., quantify how well we want to or do know the prediction is in the set. To do this we use $1-\alpha$, where $\alpha$ is the confidence level. If $\alpha = 0$ we are saying that with 100% confidence the true value will be in the prediction interval. If $\alpha = 0.05$ we are saying that with 95% confidence the true value will be in the prediction interval, and so on. Now we can write the probability statement as:

$$ \begin{equation} P(E_b^{n+1} \in \Gamma(X_{n+1})) \geq 1 - \alpha. \label{eq:conformal_prediction} \end{equation} $$

Reading through this, we have the probability of the true value $E_b^{n+1}$ (i.e., the true bandgap value at next input $X_{n+1}$) being in the prediction interval $\Gamma(X_{n+1})$ set with a confidence level of $1 - \alpha$.

The thing to note is this is more of a guarantee statement; it is saying that if we define the confidence level and interval set then we can make a statement about how likely the true value is in the interval. We never need to directly compute the probability of the true value being in the interval in $\eqref{eq:conformal_prediction}$, which is why this method is called "distribution-free".

We also need a way to measure how "bad" the model's guess was compared to the truth, that is compute some loss function, in other words a nonconformity score. There are several different ways to define the nonconformity score depending on the type of prediction we are making. I'll just show the L1 norm as an example for regression, which would be

$$ \begin{equation} s_i = \vert E_{b}^{i} - \hat{E}_{b}^{i} \vert \nonumber \end{equation} $$

where $\hat{E}_b^{i}$ is the model's prediction for the bandgap value given input $X_{i}$, and the true value is $E_b^{i}$; $s_i$ is the nonconformity score for the $i$-th prediction. For example, a nonconformity score of 0.5 eV quantifies how much the model's prediction deviates from the true value. This is standard in regression. Importantly, the nonconformity scores are computed on a calibration set, that is, for $i \in {1, 2, \ldots, m}$, where this set is held out and not used for training.

So how do we use this in practice?

Switching from bandgap to forces, consider a GNN trained to predict atomic forces from atomic structures. We hold out a calibration set and compute nonconformity scores $s_i = |F_i^{true} - \hat{F}_i|$ for each force component, where $F_i^{true}$ is the true force and $\hat{F}_i$ is the model prediction.

We now build the prediction interval. Choose a desired coverage level, say $1-\alpha = 0.9$. Sort the $m$ calibration scores $s_{(1)} \le \cdots \le s_{(m)}$ and take $(r = \lceil (m+1)(1-\alpha) \rceil)$. The threshold is the $r$-th smallest score, $\hat q = s_{(r)}$. This is the standard split-conformal finite-sample rule.

Now we take a new atomic structure $X_{n+1}$ and compute the updated prediction interval.

$$ \begin{equation} \Gamma(F^{n+1}) = [\hat{F}^{n+1} - \hat q, \hat{F}^{n+1} + \hat q] \label{eq:conformal_prediction_updated} \end{equation} $$

With the updated prediction interval (with $\hat q$ chosen as above) we can then make a statement about how likely the true value is in the interval in $\eqref{eq:conformal_prediction_updated}$.

Again this is pretty basic concepts that are to follow and its more that the theory guarantees the prediction interval/set will contain the true value with the desired confidence level. This makes it simpler than Bayesian methods, where you have to "try" and compute the full posterior distribution through likelihood and prior probability distributions, all which require reasonably good choices.

To further illustrate let me work through this simple example of computing the conformal prediction for the forces using the orb-v3 pre-trained models on structures from the load-atoms GST dataset. The basic steps are to define the calibration and test sets, compute the nonconformity scores, build the prediction interval/set, and then make a statement about how likely the true value is in the interval. Below is the code snippet that shows the conformal prediction for the forces in addition to the model predictions.

Code

  # /// script
# requires-python = ">=3.8"
# dependencies = [
#   "numpy",
#   "matplotlib",
#   "orb-models",
#   "load-atoms",
#   "seaborn",
# ]
# ///
import math
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from matplotlib.patches import Patch
from sklearn.neural_network import MLPRegressor
from load_atoms import load_dataset
from orb_models.forcefield import pretrained
from orb_models.forcefield.calculator import ORBCalculator

data = load_dataset("GST-GAP-22")
calibration, tests = data.random_split([500, 100], seed=42)

orb = pretrained.orb_v3_direct_20_omat(device="cpu")
calc = ORBCalculator(orb, device="cpu")


def pred_forces(atoms):
    atoms.calc = calc
    return atoms.get_forces()


# Locally adaptive split-conformal prediction using normalized residuals
# Note: These are prediction intervals, not confidence intervals
calibration_true = np.concatenate([a.arrays["forces"].ravel() for a in calibration])
calibration_pred = np.concatenate([pred_forces(a).ravel() for a in calibration])
scores = np.abs(calibration_true - calibration_pred)  # L1 nonconformity scores
z_calibration = np.abs(calibration_pred)  # feature: force magnitude


def fit_cqr_model(X, y, alpha=0.001):
    """CQR model for adaptive conformal prediction - learns local uncertainty scales.

    This model learns the relationship between force magnitude and prediction
    uncertainty to create adaptive prediction intervals.
    """
    model = MLPRegressor(
        hidden_layer_sizes=(100, 50, 25),
        activation="relu",
        solver="adam",
        alpha=alpha,
        learning_rate="adaptive",
        learning_rate_init=0.001,
        max_iter=1000,
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=20,
        random_state=42,
    )
    model.fit(X, y)
    return model


# Fit adaptive scale model and compute normalized scores
cqr_scale = fit_cqr_model(z_calibration.reshape(-1, 1), scores, alpha=0.01)
sigma_calibration = np.clip(cqr_scale.predict(z_calibration.reshape(-1, 1)), 1e-8, None)
scores = np.sort(scores / sigma_calibration)

test_true = np.concatenate([a.arrays["forces"].flatten() for a in tests])
test_pred = np.concatenate([pred_forces(a).flatten() for a in tests])

z_test = np.abs(test_pred)
sigma_test = np.clip(cqr_scale.predict(z_test.reshape(-1, 1)), 1e-8, None)

levels = {
    "2σ": math.erf(2 / np.sqrt(2)),
    "3σ": math.erf(3 / np.sqrt(2)),
}

# ---- Plotting ----
lo, hi = float(min(test_true.min(), test_pred.min())), float(
    max(test_true.max(), test_pred.max())
)
ax_lim = max(abs(lo), abs(hi))
lo, hi = -ax_lim, ax_lim
idx = np.argsort(test_true)
xs, ys = test_true[idx], test_pred[idx]

fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharex=True, sharey=True)
hexbin_for_colorbar = None
for ax, (tag, p) in zip(axes, levels.items()):
    delta_n = scores[
        min(max(int(np.ceil((len(scores) + 1) * p)) - 1, 0), len(scores) - 1)
    ]
    halfw = delta_n * sigma_test
    hw_s = halfw[idx]

    inside_band = np.abs(test_pred - test_true) <= halfw
    ax.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inside_band_sorted = np.abs(ys - xs) <= hw_s

    main_mask = (xs >= -1.5) & (xs <= 1.5) & (ys >= -1.5) & (ys <= 1.5)
    main_xs = xs[main_mask]
    main_ys = ys[main_mask]
    main_hw_s = hw_s[main_mask]
    main_inside = inside_band_sorted[main_mask]

    # Plot adaptive prediction interval boundaries
    ax.plot(main_xs, main_xs - main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)
    ax.plot(main_xs, main_xs + main_hw_s, "b-", lw=0.2, alpha=0.6, zorder=2)

    confident_xs = main_xs[main_inside]
    confident_ys = main_ys[main_inside]
    if len(confident_xs) > 0:
        hb = ax.hexbin(
            confident_xs, confident_ys, gridsize=25, cmap="Greens", alpha=0.7, mincnt=1
        )

    uncertain_xs = main_xs[~main_inside]
    uncertain_ys = main_ys[~main_inside]
    if len(uncertain_xs) > 0:
        ax.scatter(
            uncertain_xs,
            uncertain_ys,
            s=1.5,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    coverage = np.mean(inside_band) * 100
    ax.set_title(
        f"{tag}: Test Coverage = {coverage:.1f}%", fontsize=14, fontweight="bold"
    )
    ax.set_xlabel("True Forces (eV/Å)", fontsize=12)
    if ax is axes[0]:
        ax.set_ylabel("Predicted Forces (eV/Å)", fontsize=12)
    ax.tick_params(labelsize=11)
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect("equal")

    inset = inset_axes(ax, width="35%", height="35%", loc="upper left")
    inset.plot([lo, hi], [lo, hi], "k-", lw=0.2)
    inset.plot(xs, xs - hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)
    inset.plot(xs, xs + hw_s, "b-", lw=0.15, alpha=0.6, zorder=2)

    confident_xs_full = xs[inside_band_sorted]
    confident_ys_full = ys[inside_band_sorted]
    if len(confident_xs_full) > 0:
        hb_inset = inset.hexbin(
            confident_xs_full,
            confident_ys_full,
            gridsize=20,
            cmap="Greens",
            alpha=0.7,
            mincnt=1,
        )
        hexbin_for_colorbar = hb_inset

    uncertain_xs_full = xs[~inside_band_sorted]
    uncertain_ys_full = ys[~inside_band_sorted]
    if len(uncertain_xs_full) > 0:
        inset.scatter(
            uncertain_xs_full,
            uncertain_ys_full,
            s=1.0,
            alpha=0.5,
            color="red",
            marker="x",
            zorder=1,
        )

    inset.set_xlim(lo, hi)
    inset.set_ylim(lo, hi)
    inset.set_aspect("equal")
    inset.tick_params(labelsize=8)

plt.suptitle(
    "Adaptive Conformal Prediction for Force Prediction Intervals",
    fontsize=16,
    fontweight="bold",
)

legend_elements = [
    Patch(facecolor="red", label="Model Uncertain"),
    Patch(facecolor="blue", label="Model Adaptive PI"),
]

fig.legend(
    handles=legend_elements,
    loc="center",
    bbox_to_anchor=(0.35, 0.15),
    ncol=1,
    fontsize=10,
    framealpha=0.9,
)

if hexbin_for_colorbar is not None:
    cbar = fig.colorbar(
        hexbin_for_colorbar,
        ax=axes,
        orientation="vertical",
        shrink=0.6,
        aspect=10,
        pad=0.15,
    )
    cbar.set_label("Point Density", fontsize=9, labelpad=-60)
    cbar.ax.tick_params(labelsize=8, labelleft=False, labelright=True)
    cbar.ax.set_position([0.52, 0.15, 0.02, 0.7])

    ticks = cbar.get_ticks()
    formatted_labels = []
    for tick in ticks:
        if tick >= 1000:
            formatted_labels.append(f"{tick/1000:.0f}K")
        elif tick >= 100:
            formatted_labels.append(f"{tick/1000:.1f}K")
        else:
            formatted_labels.append(f"{tick:.0f}")
    cbar.set_ticklabels(formatted_labels)

plt.tight_layout()
plt.subplots_adjust(bottom=0.1, wspace=0.30, top=0.85)
plt.savefig("force_parity_adaptive.png", dpi=600, bbox_inches="tight")

Figure 1. Force Parity plot for test data including adaptive conformal prediction bands.

The results for the prediction interval levels of 95.3% (2$\sigma$) and 99.6% (3$\sigma$) are shown in the figure above for 500 calibration structures and 100 test structures. I use a locally adaptive split-conformal method: fit a function $\sigma(|\hat F|)$ on the calibration set to model residual scale, compute normalized scores $s_i/\sigma_i$, select the threshold $\hat q$ on these normalized scores as above, and then set the half-width $h(X)=\hat q\,\sigma(|\hat F|)$. This provides heteroscedastic, input-dependent intervals.

The adaptive conformal prediction bands reveal details about model performance. The achieved coverage (95.3% for 2$\sigma$, 99.6% for 3$\sigma$) closely matches the theoretical expectations (95.45% and 99.73% respectively), so we can say the model is well-calibrated. Points outside the prediction bands represent cases where the model's uncertainty estimate was insufficient, i.e., these are the "hard" cases where the model struggles to make the correct predictions. While 3$\sigma$ bands provide higher confidence (99.6% vs 95.3%), they are wider and less informative. The ideal model would provide narrow bands with high confidence, indicating both precision and reliability.

Notice in the inset that the bands have large uncertainty in the large force regime. This points to the fact that the model may be unable to properly capture regimes where the potential energy surface exhibits large gradients.

Beyond Distribution-Free Guarantees

Exact split-conformal validity holds when the score function (and any normalization) is fixed before seeing the calibration labels. Here the scale function $\sigma(\cdot)$ is fit on the calibration set itself, which can break exchangeability of the scores and thus the finite-sample guarantee. The trade-off often yields tighter, more efficient intervals, but coverage should be validated empirically. Using a separate proper-training set to learn $\sigma(\cdot)$ and reserving calibration solely for ranking scores restores the classical guarantee.

There is an inherent tension in uncertainty quantification between confidence and precision. Higher sigma values (3$\sigma$) provide greater confidence that the true value lies within the band, but at the cost of wider, less informative intervals. Conversely, lower sigma values (2$\sigma$) provide narrower, more precise intervals but with lower confidence guarantees.

The other aspect is we can asses the honesty of the model by looking and the test coverage, i.e., how often the true value is in the prediction interval. In Figure 1 we see the test coverage is close to the confidence level for 2$\sigma$ and 3$\sigma$.

I think this was an interesting approach to model uncertainty that is simple in a lot of ways. So what are the downsides? The main one is that it requires a lot of calibration and test data to compute reliable prediction intervals. If you have a small dataset, you may not be able to compute the prediction interval/set well. Another issue is that conformal prediction gives you marginal coverage, not conditional coverage, meaning the guarantees hold on average over your whole dataset but not uniformly for every input, so you might have a model that's well-calibrated overall but terrible for specific material classes or rare chemistries. The method also assumes your data is i.i.d. or exchangeable, but if you have time dependence, selection bias, or covariate shift, the guarantees can break down completely. The efficiency depends heavily on your choice of nonconformity score, so if you pick a poor one you'll get conservative, overly wide prediction intervals. It also doesn't give you per-point probabilities, just sets with frequency guarantees, so if you need likelihoods. Distribution shift can inflate your residual quantiles, giving you uninformative bands, and for structured targets like energy+forces+stress, joint coverage can be quite conservative.

Where does one go from here?

The prediction bands provide a natural criterion for active learning, i.e., points falling outside bands represent high-uncertainty cases that would benefit from additional training data. The idea then is to use conformal predictions to create uncertainty quantification for ML models and therefore in an active learning framework think about how to query the most informative samples to make improvements. With conformal predictions you would propose a bunch of unlabeled test inputs, pass through the model, and then see if the true value is in the prediction interval to establish the degree of confidence (i.e., uncertainty) in the prediction. Then based on your sampling strategy you would query a oracle to label the data point(s). Then one would retrain/fine-tune the model with new data, update the calibration set, and repeat the process.

Uncertainty quantification isn't about being right, it's about being right with the right amount of confidence. A model that's always right but with huge prediction bands is less useful than one that's right 95% of the time with narrow, informative bands. Our results demonstrate a well-calibrated model that appropriately increases uncertainty for challenging cases, which is exactly what you want in a reliable uncertainty quantification system.


References

[1] A.N. Angelopoulos, S. Bates, A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, (2022). https://doi.org/10.48550/arXiv.2107.07511.



Reuse and Attribution

Saturday, August 30, 2025

Down memory lane: Quantum Computing

I spent about 2.5 years working on variational quantum algorithms for noisy intermediate-scale quantum (NISQ) devices [1]. The question was straightforward: can we do anything useful with these noisy, small-scale, "universal" quantum devices? Here useful was typically to mean faster, but could also mean solve far too complex problems for classical quantum chemistry.

The short answer I came to was: not in any way that clearly demonstrated a speedup over well-tuned classical solvers. The longer answer: you can get them to run, get numbers back, even match the literature, but there's no clear, reproducible speed-up. Most of the effort is in engineering the system to produce anything coherent, not in pushing computational frontiers. I'm not sure if this is a good thing or a bad thing.

So what is the state now? Are there any clear signs of utility for variational quantum algorithms? Are there any new quantum algorithms for chemistry or physics that require error correction but have proven advantage over classical computers? My guess is the answer is no not really, but I'm not sure and for now don't have the time to read up and investigate.

What I saw with Universal Quantum Computing in the NISQ Era

Universal quantum computing here means we have a qubit, quantum information analog of digital bits, and we can do arbitrary single-qubit rotations and controlled two-qubit gates to produce logic operations that put qubits into superposition and can entangle them. We can also compose them into circuits that approximate any unitary evolution.

For famous algorithms like Shor's or Grover's, the path to usefulness is clear if you had fault-tolerant quantum computing with sufficient qubits. But in the NISQ setting, VQA [2] or QAOA are the only viable options. There might be some new class of NISQ-like algorithms, I'm not sure, but it's probably still safe to say VQA and QAOA are dominant.

Lets say that you moved beyond the NISQ era, which may well be happening, and thus the noise and decoherence are under control, there's still the ansatz problem, that is how do you pick a circuit structure that can efficiently represent the solution state in the first place?

The Ansatz Problem

An ansatz is a parameterized (quantum circuit) guess for the form of your quantum state. In VQAs, it's the fixed sequence of gates you tune with a classical optimizer. In phase estimation (PEA) or Hamiltonian simulation, it's often the state-preparation step for your quantum algorithm.

The difficulty is balancing expressivity and feasibility:

  • Too shallow: can't represent the physics; optimizer converges to the wrong state.
  • Too deep: hardware noise kills you in NISQ; in fault-tolerant hardware, depth inflates T-gate and qubit costs.
  • Too generic: risks barren plateaus [3].
  • Too problem-specific: works only on one Hamiltonian.

In PEA, the ansatz problem just shifts to the state-preparation step. You might nail the controlled-unitary and inverse QFT, but if you can't efficiently prepare the eigenstate, you'll likely get garbage.

So what did I work on?

Most of the creativity in the research I did comes all from my colleague/co-author, I was mostly involved in domain application, instrumentation, and analysis. There were two works [4-5] but the one I'll highlight is probably the least interesting: we developed a hybrid quantum–classical eigensolver without variation or parametric gates[4]. The idea is to project the problem Hamiltonian into a smaller subspace, measured term-by-term with short circuits, and diagonalized classically.

This allowed us to:

  • Extract ground and excited states for small molecules (BeH₂, LiH).
  • Validate against exact diagonalization.
  • Run on the quantum hardware at the time (i.e., IBM devices).

It avoided long, problem-tailored ansatz circuits, but the choice of basis in the subspace projection is still a hidden ansatz.

A Self-Critique of Our Hybrid Eigensolver Work

We avoided variational loops and deep, problem-specific ansätze: no parameterized circuits, no barren plateau optimizers. The issue, though, was

  1. We dodged the ansatz problem: The reduced-space basis is still an ansatz, thus performance depends on making a smart choice. We didn't quantify sensitivity.
  2. Hardware vs. simulation gap unexplored IBM runs matched noiseless simulations, but was this due to noise resilience, shallow circuits, or luck?
  3. Thin classical comparisons We used exact diagonalization due to simplicty of the chemical system and basis. Real claims require benchmarks vs. DMRG [6], coupled-cluster, etc.

Why NISQ VQAs Struggle

There has been a lot of work on this in the last few years and I'm not fully up to date but this is what I gather:

  • Noise vs. depth: deeper means more decoherence.
  • Barren plateaus: gradients vanish exponentially with qubits.
  • Optimizer instability: hardware drift, shot noise, optimizer quirks.
  • Classical competition: tensor networks, DMRG [6] often scale better [7].
  • Ansatz rigidity: wrong ansatz wastes all gates and shots.

Summary of My Position

I'm not a seasoned quantum algorithm researcher, but from my limited research, I see NISQ-era of VQAs as mainly useful for benchmarking, with their progress limited by the challenge of designing effective ansätze. In quantum chemistry, there is little convincing evidence so far that VQAs offer practical advantages1. Digital quantum simulation is highly flexible but comes with significant resource costs. Analog simulation, which directly emulates physical systems, is already useful in certain specialized areas but will always be a niche. Looking ahead to fault-tolerant quantum computing, real breakthroughs may be possible, but efficient state preparation will remain a central obstacle.

Footnotes


  1. Maybe this has changed but I wager probably not and the paper by Lee et al. [7] is a good indicator. 

References

[1] J. Preskill, Quantum Computing in the NISQ era and beyond, arXiv (2018). [2] M. Cerezo et al., Variational Quantum Algorithms, arXiv (2021). [3] M. Larocca et al., Barren Plateaus in Variational Quantum Computing, arXiv (2024). [4] P. Jouzdani & S. Bringuier, Hybrid Quantum–Classical Eigensolver Without Variation or Parametric Gates, Quantum Reports 3, 8 (2021). DOI [5] P. Jouzdani, S. Bringuier, M. Kostuk, A method of determining molecular excited-states using quantum computation, MRS Advances 6 (2021) 558–563. DOI [6] S. R. White, Density matrix formulation for quantum renormalization groups, PRL 69, 2863 (1992). [7] S. Lee et al., Evaluating the evidence for exponential quantum advantage in ground-state quantum chemistry, Nat. Commun. 14, 1952 (2023). DOI


Reuse and Attribution

Saturday, July 12, 2025

Semi-Empirical Methods get a boost 🚀

Density Functional Theory (DFT) is a mainstay for those studying in-silico molecules or materials. It works well, is performant, and advances in XC functionals will probably make it even more useful and powerful. However, there have been for some time what are called semi-empirical methods that are not as accurate but are much faster and can provide a good initial idea of the electronic structure of a material. They are called semi-empirical because the inner workings are parameterized rather than derived from first principles. This, in essence, makes them fast, but with the downside that they aren't as transferable to other materials or structures, and their formalism may restrict the description of certain electronic effects.

Tight-binding is one such method. It is based on the LCAO approximation where atomic orbitals remain localized on their parent atoms, and molecular/crystal orbitals are constructed as linear combinations of these atomic orbitals. The method uses a minimal atomic orbital basis set and is straightforward to implement for describing electronic structure of molecules and materials. While accuracy is typically lower than DFT, it can be quite effective for certain material classes.

Why is TB a semi-empirical method? Because it has parameters that are fit to data, specifically the hopping integrals between atomic orbitals. These integrals describe electronic interactions between atoms and are determined by fitting to experimental data or higher-level calculations. The hopping integrals are used to construct the Hamiltonian matrix, which is then diagonalized to obtain electronic structure through the secular equation.

Since atomic orbitals are localized on atomic positions, tight-binding works well for molecules, clusters, and materials where electrons form localized chemical bonds, such as semiconductors. However, it is less effective for metals where the minimal atomic orbital basis inadequately describes metallic bonding1 or where strong electron-correlation effects are important.

I've personally never used tight-binding methods extensively, but they can be very useful for the right material or chemical system. However, I believe their traditional limitations are beginning to change, allowing for broader applicability.

A General-Purpose Extended Tight-Binding Calculator

The team behind g-xTB [1] is the Stefan Grimme Lab, which is well-known and deeply respected in computational chemistry. Their motivation for g-xTB is driven by the goal to close the cost-accuracy gap between semi-empirical tight-binding and hybrid-DFT methods without sacrificing speed. The older GFN2-xTB approaches tried to address this by adding dispersion and hydrogen-bonding corrections onto a minimal basis. Ultimately, however, GFN2-xTB aligned more closely with GGA-DFT, lacked Fock exchange, and retained the limitations inherent in rigid atomic orbitals. The idea behind g-xTB is to overcome these limitations and cover most of the periodic table, aiming for DFT-level accuracy at tight-binding speed, enabling calculations involving reaction thermochemistry, excited states, and more.

g-xTB Advancements

Some key areas in which g-xTB advances tight-binding theory are:

  1. Increased flexibility due to an atom-in-molecule adaptive q-vSZP basis.
  2. A refined Hamiltonian incorporating range-separated Fock exchange.
  3. An explicit first-order term and extensions up to the fourth order in the charge-fluctuation series.
  4. Charge-dependent Pauli-penetration repulsion.
  5. Streamlined atomic-correction potentials that address basis shortcomings.

Importantly, every element from H to Lr is covered by a single parameter set trained on 32,000 diverse data points, including "mindless" molecules. This significantly enhances usability.

Outcomes

It appears that g-xTB achieves DFT-level accuracy at tight-binding speed. Across GMTKN55, g-xTB reduces GFN2-xTB’s WTMAD-2 from 25.0 to 9.3 kcal mol⁻¹, matching low-cost hybrid composites while running only about 40% slower than GFN2-xTB and approximately 2,400 times faster than B3LYP-D4 on a 2,000-atom complex—at least according to my understanding of the paper's metrics. The authors also highlight that reaction barriers, spin-state gaps, transition-metal thermochemistry, and large biomolecular zwitterions are now tractable without SCF failures. This gives users a robust drop-in replacement for GFNn-xTB and, in many screening and dynamics workflows, a viable substitute for mid-tier DFT calculations.

Usage

The authors have released a Linux binary, which is usable if you set up your input files and parameter paths correctly.

To simplify things, I decided to create an ASE wrapper [2] for the g-xTB calculator. This is a fairly common and straightforward task in ASE. Existing calculator wrappers for other tight-binding codes might work if environment variables are adjusted correctly, but I preferred to write the file parsers from scratch (it goes quickly these days 😉).

Since the wrapper won't be regularly maintained and given that the g-xTB binary will eventually be implemented in a more mainstream TB framework, I decided to keep it as a git repo you can clone and install:

python -m venv .venv
source .venv/bin/activate
git clone --recursive https://github.com/stefanbringuier/g-xtb-ase
pip install g-xtb-ase/

Warning

Direct pip installation via git+... isn't supported because pip doesn't handle git submodules. You must clone with --recursive to get the required g-xTB binary and parameter files.

To import the calculator object:

from gxtb_ase import GxTB
...
atoms.calc = GxTB(charge=0, spin=0)

What can you do with it?

You can use it like any other ASE calculator to get total energies and forces (stress not yet supported):

from ase.build import molecule
from gxtb_ase import GxTB

atoms = molecule("H2O")
atoms.calc = GxTB(charge=0)
energy = atoms.get_potential_energy()
forces = atoms.get_forces()

This is useful for geometry optimizations or molecular dynamics (MD). For example, running 3 water molecules in MD at 25 °C for 10 ps with g-xTB takes about 45 minutes on a single CPU. However, I'm unsure how accurate structural details (e.g., RDF or structure factors) compare to other methods.

A nice feature of tight-binding methods is their ability to calculate atomic charges, allowing you to analyze charge transfer, bond orders, bond formation/breakage, and dipole moments in response to external fields.

The wrapper doesn't currently extract electronic structure details, but these are available in the log files.

Web App

I built a backend API using FastAPI and quickly assembled a front end. Compute resources are limited, so large molecules or heavy usage might cause crashes, but water or methane tests should work fine. IR and thermochemistry calculations are also supported.

Expect crashes & bugs

The API and front end are hosted on Render and Netlify, respectively, using their free tiers, limiting resources and uptime. The web app isn't fully unit-tested, so expect potential bugs or app crashes.

Here is an example using it to get the optimized (limited steps) geometry for a simple linear molecule, hydrogen cyanide, and then using that to get the IR spectrum and thermochemistry.

Figure 1. Hydrogen cyanide structure, IR, and thermochemistry using g-xTB web app

The agreement is overall pretty good with the IR and thermochemistry, although the heat-capacity is way off but could be a conversion/bug issue in the web app. As of now I think g-xTB can only be used for molecules as I don't see how with the format you specify PBC and a cell, but maybe I'm missing something.

Footnotes


  1. See Jellium Model or Uniform Electron Gas where the electrons behave like a fluid/gas and those models are useful for understanding the effect of placing a positive charge in them to create a metal. 

References

[1] T. Froitzheim, M. Müller, A. Hansen, S. Grimme, g-xTB: A General-Purpose Extended Tight-Binding Electronic Structure Method For the Elements H to Lr (Z=1–103), (2025). https://doi.org/10.26434/chemrxiv-2025-bjxvt.

[2] S Bringuier, ASE Wrapper for g-xTB, 2025. https://github.com/stefanbringuier/g-xtb-ase



Reuse and Attribution