|𝔻⟩irac's Student

Sunday, July 5, 2026

Intelligence Power per Dollar

I was thinking while using codex this weekend that not every token is equally as valuable as any other token. Much of the metrics you see are related to token generation rate or cost per token but that is only a partial picture. When I use a model I care about how much useful intelligence output I get for how much it costs and how long it takes to achieve the desired utility. Since benchmark accuracies, API price, and tokens per second are almost always documented, I was thinking how come none of them alone answers the question I actually want:

How much useful token intelligence power am I buying per dollar?

Below I am proposing a kind of FOM (Figure of Merit) that compares models on quality, speed, and cost together. I have no idea how useful this will be or how well my definitions for "token intelligence" will hold up. But I think it is a useful thought experiment. Again, the central idea is that not all tokens are equally valuable, which is indeed strongly tied to the model capabilities. In my view, the natural unit for combining value with time and money is what I would call intelligence action, by analogy with physical action and information-theoretic memory.

Why benchmarks, price, and speed are not enough

Three quantities dominate public comparisons:

Quantity	What it measures	What it misses
Benchmark score	Task accuracy on fixed evals	Dollar cost and latency
Cost per token	API price	Whether the token is any good
Throughput (token/s)	Generation speed	Quality of each token

A frontier model can score a few points higher yet cost an order of magnitude more. A cheap model can be fast yet unreliable on the tasks you care about. Leaderboards such as the Artificial Analysis Intelligence Index already compress many evals into one score, and several sites rank "value" as intelligence divided by price, including WhatLLM.org and Analytical Insider. Those are useful, but they still omit time: two models with the same intelligence-per-dollar can feel very different if one streams at 250 token/s and the other at 45 token/s.³

I want one scalar that rewards quality and thrift together, while keeping the pieces visible, including speed as an explicit tradeoff rather than something the score silently maximizes.

Naming by analogy

My take

I do not know if this is a good take/analogy but this is how I am naming/coining terms. There might already be a term for this but I do not know what it is.

In physics, action scales as energy × time. In information theory, information capacity over time scales as bit × second (I believe this is sometimes also called memory).

Language models output tokens, not joules or bits although one could think about tokens in these terms :wink:. By the same pattern, intelligence action scales as token × second, and its time-rate is what I call intelligence power (token / second), the quantity the figure of merit below actually prices out. This is just nomenclature I choose it is not a properly derived claim. The analogy is intentionally shallow, I'm just borrowing the product-of-content-and-time structure without importing the rest. The payoff is trying to gain meaningful readable units.

Token Intelligence Value Factor (TIVF)

I'm now going to describe what I call the Token Intelligence Value Factor (TIVF). It is the intelligence content of one generated token, expressed in reference-equivalent tokens. To keep units honest, I'll distinguish two kinds of token throughout: a raw-token is the physical unit you are actually billed for and that throughput $R$ and price $\tilde{c}_\mathrm{out}$ are measured in, while a ref-token is the intelligence-equivalent unit, i.e. one raw-token emitted by the reference model. So one raw-token from model $m$ is worth $\mathrm{TIVF}(m)$ ref-tokens from a fixed reference model $m_\mathrm{ref}$, and TIVF carries the unit ref-token/raw-token.

I'm going to define this as

$$ \begin{equation} \mathrm{TIVF}(m_\mathrm{ref}) \equiv 1\ \text{ref-token}/\text{raw-token}. \label{eq:tivf_def} \end{equation} $$

So TIVF is not dimensionless, its has the unit ref-token/raw-token², but importantly it is a token whose "intelligence" has been rescaled. A model with $\mathrm{TIVF}=2$ produces tokens that count double toward intelligence action relative to a defined reference model.

Capability spectrum

This is a hard one to define because intelligence is not one number. A model can score well on knowledge yet fail on agentic terminal work, or ace GPQA yet stumble on SWE-bench Pro. To try and approximate a broad spectrum, I'll assign each model six public scores $S_k(m)$, on a 0--100 scale and $k$ is the score axis, each measured at some thinking effort, $e$, if supported:

Axis $k$	Benchmark	What it probes
Knowledge	MMLU-Pro [1]	Broad multi-subject knowledge
Science	GPQA Diamond [2]	Graduate science Q&A
Coding	SWE-bench Verified [3]	Real GitHub issue resolution
Hard coding	SWE-bench Pro	Harder multi-language software engineering
Reasoning	Humanity's Last Exam w/ tools	Multidisciplinary expert reasoning
Agentic	Terminal-Bench 2.x	Shell/terminal agent coordination

I am trying to mirror the spirit of composite indices that are shown in Artificial Analysis Intelligence Index but keeps each axis visible instead of collapsing them.

Token intelligence rescaling

For each score axis $k$, I'll normalize to the reference model $m_\mathrm{ref}$, which will be GPT-4o mini ¹, and apply a mild superlinear exponent $\gamma=1.2$. The thinking is this makes sure that when more performant models do especially well on the hardest benchmarks, those factors still show up, even if many models already get top scores on easier. What We get is something like

$$ \begin{equation} \mathrm{TIVF}_k(m,e) = \left(\frac{S_k(m,e)}{S_k(m_\mathrm{ref})}\right)^{\gamma}. \label{eq:tivf_domain} \end{equation} $$

$S$ is the score of the model $m$ on the axis $k$. The reference model anchors every axis: $\mathrm{TIVF}_k(m_\mathrm{ref})=1$ for all $k$. The score ratio $S_k(m,e)/S_k(m_\mathrm{ref})$ is itself dimensionless; the ref-token base scale is injected only through the normalization $\mathrm{TIVF}(m_\mathrm{ref})\equiv 1\ \text{ref-token}/\text{raw-token}$ from eq. \ref{eq:tivf_def}.

The choice for GPT-4o mini as reference model is because it seems to be the most-used model across stacks. Although it might be that Anthropic models lead enterprise API usage.

In order to prevent a single strong axis to dominate/compensate for the weaker ones, we can take a weighted geometric mean of the axes:

$$ \begin{equation} \boxed{ \begin{aligned} \mathrm{TIVF}(m,e) &= \kappa(m)\cdot \exp\left(\sum_k w_k \ln \mathrm{TIVF}_k(m,e)\right) \\ &= \kappa(m)\cdot\prod_k \mathrm{TIVF}_k(m,e)^{\,w_k} \quad\big[\text{ref-token}/\text{raw-token}\big] \end{aligned} } \label{eq:tivf} \end{equation} $$.

Then we use equal weights (i.e., $w_k=1/6$). One thing that we need to account for is models that can handle larger context do not automatically indicate "smarter" answers, but rather address the complexity/class of problems the model can attempt to address. The inclusion of a context modifier $\kappa(m)$ from eq. \ref{eq:context_modifier} tries to address what tasks are "attemptable" by a model.

By construction $\mathrm{TIVF}(m_\mathrm{ref},e)\equiv 1$ ref-token/raw-token when every $S_k(m_\mathrm{ref})=S_k(m_\mathrm{ref})$. A model with $\mathrm{TIVF}=4$ produces tokens that count quadruple toward intelligence action relative to the reference across the full spectrum, not on a single leaderboard. Note the geometric (not arithmetic) mean was deliberate on my part. A weak axis will drag the product down rather than being averaged away by a strong one, so a hard-capability cliff (e.g. high knowledge but failing coding) is penalized rather than hidden. Any scalar can still mask a single-axis failure, which is why the per-axis $S_k$ are reported alongside TIVF.

$$ \begin{equation} \kappa(m) = 1 + \beta\cdot\ln\left(\frac{L_m}{L_\mathrm{ref}}\right), \qquad \beta = 0.087. \label{eq:context_modifier} \end{equation} $$

With $m_\mathrm{ref}$ corresponding to GPT-4o mini ($L_\mathrm{ref}=128\,\mathrm{k}$ tokens), a 1M-token model picks up roughly an 18% multiplicative bonus ($\beta=0.087$). The exponent is small on purpose so context is an enabler, not a substitute for reasoning scores. Models with context windows below $L_\mathrm{ref}$ get $\kappa<1$, an intentional penalty reflecting the narrower class of tasks they can attempt. In truth, this is probably the framework's least defensible knob since context length is a capacity constraint, not intelligence. When a task fits comfortably inside every model's window $\kappa$ rewards headroom that yields no real utility. This means we should treat $\kappa$ as a task-attemptability gate, not an intelligence premium.

Task-specific workloads can replace eq. \ref{eq:tivf} with $\sum_k w_k\ln\mathrm{TIVF}_k$ using custom $w_k$ (e.g., overweight Hard coding for agentic coding) without changing the FOM details below. One other aspect is how "intelligence action" relates to "intelligence power". I keep both: eq. \ref{eq:intelligence_action} defines the action as a cognitive footprint (ref-token$\cdot$s) that keeps TIVF visible as a token rescaling rather than collapsing into raw throughput, while its time-rate $\mathrm{TIVF}\cdot R$ is the intelligence power (ref-token/s) that the figure of merit below actually prices out.

Including Reasoning effort

Most models expose some form of effort levels (OpenAI reasoning_effort, Anthropic extended/adaptive thinking, DeepSeek thinking modes) that change three things at once: the spectrum scores $S_k(m,e)$, hidden reasoning tokens billed as output, and wall-clock throughput $R$ (OpenAI GPT-5.5, Anthropic Opus 4.8, DeepSeek V4 pricing, evals.report).

To account for this, I tag each row with an effort level $e\in{\mathrm{none},\mathrm{high},\mathrm{max},\mathrm{fast}}$ and use an effort-specific benchmark scores when available. Hidden reasoning is modeled as a kind of effective output-price multiplier:

$$ \begin{equation} \tilde{c}_\mathrm{out}(m,e) = c_\mathrm{out}(m)\cdot \Big[1 + \lambda\cdot\big(\psi(e)-1\big)\Big], \qquad \lambda = 0.65, \label{eq:effort_cost} \end{equation} $$

where $\psi(e)$ is a billed-token multiplier ($\psi=1$ for non-thinking, $\sim 2$ at default thinking, $\sim 5$ at max/xhigh effort). Throughput $R$ is taken directly from provider API measurements at the stated effort level (Artificial Analysis, June 2026 snapshot)⁵, not scaled again by a latency factor. Note $R$ here is a response output rate, which for reasoning models includes hidden reasoning tokens, so $R$ already absorbs the thinking slowdown; $\psi$ acts only on the billed price, keeping rate ($R$) and price ($\psi$) on separate factors so reasoning is not billed twice (this is an approximation, see limitations). Fast mode rows use Anthropic fast-tier pricing with throughput scaled from the standard Opus row ($\times 2.5$).

The idea is this is a proxy for compute spent per visible token, not a literal token accounting. It prevents comparing a cheap non-thinking call against a frontier model evaluated at max effort.

Figure of merit: intelligence power per dollar

Intelligence action is a kind of cognitive footprint of a generation where the TIVF a token carries, held over the wall-clock time it takes to produce,

$$ \begin{equation} \mathcal{A}_I = \mathrm{TIVF}\cdot t \quad \Big[\tfrac{\text{ref-token}\cdot\mathrm{s}}{\text{raw-token}}\Big]. \label{eq:intelligence_action} \end{equation} $$

In the framing of Principle of Least Action, intelligence action beomce a kind of latency weight that for a fixed TIVF you minimize $\mathcal{A}_I$ by driving the time $t$ down, so a model that delivers the same intelligence faster leaves a smaller sweep. What you actually use per second is the time-rate of that intelligence, so a kind of intelligence power,

$$ \begin{equation} P_I = \mathrm{TIVF}\cdot R \quad \big[\text{ref-token}/\mathrm{s}\big], \label{eq:fom_deriv} \end{equation} $$

where $R$ is the measured output raw-token/s at effort $e$. Dividing by the effective cost rate $\tilde{c}_\mathrm{out}$ (USD per raw-token) gives the FOM that has units of intelligence power per dollar:

$$ \begin{equation} \boxed{ \mathrm{FOM}(m,e) = \frac{\mathrm{TIVF}(m,e)\cdot R(m,e)}{\tilde{c}_\mathrm{out}(m,e)} \quad\Big[\tfrac{\text{ref-token}\cdot\text{raw-token}}{\$\cdot\mathrm{s}}\Big]. } \label{eq:fom} \end{equation} $$

The numerator $\mathrm{TIVF}\cdot R$ is intelligence power (ref-token/s), but note that the denominator is per unit price, not per dollar spent because dividing by the spend rate $\tilde{c}_\mathrm{out}\cdot R$ (USD/s) cancels $R$ and leaves the speed-blind value efficiency $\mathrm{TIVF}/\tilde{c}_\mathrm{out}$ (ref-token/\$). So FOM is by design value-per-dollar $\times$ throughput: $R$ stays up top to reward latency, and the raw-token in ref-token$\cdot$raw-token/($\cdot$s) is the receipt. The $\tilde{c}_\mathrm{out}$ reduces to raw output pricing when $\psi=1$. Input/cache pricing matters when using in RAG and agents and a blended $\tilde{c}$ is easy to substitute in eq. \ref{eq:fom} if your workload is input-dominated. As a reminder $m$ the model and $e$ the effort level.

Reading the number

Larger FOM means more intelligence value per dollar, weighted by throughput for the stated effort! It is a value metric, not a capability one, so compare at matched effort: a budget model can win on FOM while losing on TIVF. Price dominates the spread ($c_\mathrm{out}$ spans $\sim$100$\times$ vs $R$ $\sim$10$\times$ and TIVF $\sim$5$\times$), so FOM is, to first order, a tokens-per-dollar ranking that TIVF only tilts.

Case study: Ballpark comparison (June 2026)

Note

Snapshot date: June 2026. Pricing and throughput numbers move quickly; treat the tables as ballpark illustrations, not a live leaderboard. Also LLMs were used heavily here to generate the tables and plots. Some aspects of the pattern analysis was also done with LLMs.

The first table lists cross-vendor inputs at matched max/xhigh effort; the second lists the Claude Opus lineage. A third table shows derived TIVF and FOM from eqs. \ref{eq:tivf} and \ref{eq:fom}. $S$ is the arithmetic mean of the six axis scores (display only). Output pricing $c_\mathrm{out}$ is from official provider pages (June 2026); throughput $R$ is output token/s from Artificial Analysis provider API measurements at the stated effort⁵. Benchmark scores are from vendor cards, evals.report, BenchLM, MorphLLM, and Anthropic system cards. Coverage spans OpenAI, Anthropic, Google, DeepSeek, xAI (Grok), Moonshot (Kimi), and Alibaba (Qwen), plus one open-weight row on GroqCloud; Groq (inference host) is not Grok (xAI's model).

Input tables and derived TIVF / FOM

Cross-vendor

Model	Effort	S	SWE-Pro	HLE tools	c_out	R
DeepSeek V4 Flash	max	69.3	54.0%	45.1%	$0.28 / M	106
DeepSeek V4 Flash	high	67.2	52.6%	40.3%	$0.28 / M	95
DeepSeek V4 Flash	none	61.1	49.1%	28.0%	$0.28 / M	94
GPT-4o mini (ref.)	high	33.3	12.0%	16.0%	$0.60 / M	66
Gemini 2.5 Pro	high	71.4	54.2%	51.4%	$10.0 / M	141
Gemini 3 Pro	high	74.1	55.0%	52.0%	$12.0 / M	130
Gemini 2.5 Flash	high	60.5	45.0%	38.0%	$2.50 / M	225
Gemini 3 Flash	high	67.7	50.0%	43.5%	$3.00 / M	300
Gemini 3.1 Flash	high	69.0	52.0%	45.0%	$3.00 / M	450
Kimi K2.6	high	72.8	58.6%	54.0%	$4.0 / M	88
Qwen3 Max	high	69.2	50.0%	49.5%	$6.0 / M	72
Grok 4	high	65.5	48.0%	42.0%	$15.0 / M	78
GPT-5.5	max	76.0	58.6%	52.2%	$30.0 / M	53
GPT-5.5	high	72.1	55.0%	50.0%	$30.0 / M	56

Claude Opus lineage

Model	Effort	S	SWE-Pro	HLE tools	c_out	R
Claude Opus 4.5	high	67.8	45.9%	43.4%	$25.0 / M	42
Claude Opus 4.6	high	72.5	53.4%	53.1%	$25.0 / M	45
Claude Opus 4.7	high	77.0	64.3%	54.7%	$25.0 / M	45
Claude Opus 4.7	max	77.8	65.5%	55.5%	$25.0 / M	55
Claude Opus 4.8	high	79.2	69.2%	57.9%	$25.0 / M	58
Claude Opus 4.8	max	79.8	70.0%	58.5%	$25.0 / M	58
Claude Opus 4.8	fast	79.2	69.2%	57.9%	$50.0 / M	145

Derived TIVF and FOM

Model	Effort	TIVF [ref-token/raw-token]	FOM [ref-tok·raw-tok/($·s)]
DeepSeek V4 Flash	none	3.23	1.08 × 10⁹
DeepSeek V4 Flash	high	3.76	7.73 × 10⁸
DeepSeek V4 Flash	max	3.94	4.15 × 10⁸
Gemini 3.1 Flash	high	4.12	3.75 × 10⁸
Llama 3.3 70B (Groq)	high	1.18	3.58 × 10⁸
Gemini 3 Flash	high	3.81	2.31 × 10⁸
Gemini 2.5 Flash	high	3.28	1.79 × 10⁸
GPT-4o mini	high	1.00	6.67 × 10⁷
Kimi K2.6	high	3.85	5.13 × 10⁷
Gemini 2.5 Pro	high	4.15	3.55 × 10⁷
Claude Haiku 4.5	high	2.59	3.36 × 10⁷
Gemini 3 Pro	high	4.33	2.84 × 10⁷
Qwen3 Max	high	3.55	2.58 × 10⁷
Claude Opus 4.8	fast	4.77	1.38 × 10⁷
Grok 4	high	3.27	1.03 × 10⁷
Claude Sonnet 4.6	high	3.70	8.22 × 10⁶
Claude Opus 4.8	high	4.77	6.71 × 10⁶
Claude Opus 4.7	high	4.57	4.98 × 10⁶
GPT-5.5	high	4.19	4.74 × 10⁶
Claude Opus 4.6	high	4.21	4.59 × 10⁶
Claude Opus 4.5	high	3.34	3.40 × 10⁶
Claude Opus 4.8	max	4.82	3.11 × 10⁶
Claude Opus 4.7	max	4.64	2.83 × 10⁶
GPT-5.5	max	4.49	2.20 × 10⁶

Figure 1. FOM vs TIVF scatter (June 2026 snapshot).

Figure 2. FOM ranked by model and effort level.

Figure 1 plots FOM against TIVF with the point color representing provider and the marker shape representing thinking effort (two-column legend). Figure 2 ranks every row by FOM on a log scale. The bottom line from Figure 1 seems to be if you want the most intelligence power per dollar, use DeepSeek V4 Flash with high thinking. Opus 4.8 [max] clearly scores higher on TIVF ($\approx 4.8$ vs $\approx 3.8$ for DeepSeek [high]) but buys roughly 250$\times$ less intelligence power per dollar. Several patterns fall out immediately:

Broad-spectrum TIVF separates frontier from budget models. At max effort, Opus 4.8 ($\approx 4.8$) leads GPT-5.5 ($\approx 4.5$), Gemini 3 Pro ($\approx 4.3$), and DeepSeek V4 Flash ($\approx 3.9$).
DeepSeek wins FOM because price is low and speed is high. At \$0.28/M output and $\sim$95--106 token/s it buys roughly $\sim$250$\times$ more intelligence power per dollar than Opus 4.8 at max effort; Kimi K2.6 and Qwen3 Max trail.
The Gemini Flash line is the best-value major-lab option. At \$3/M output and 300--450 token/s, Gemini 3.1 Flash (TIVF $\approx 4.1$, FOM $\approx 3.7\times 10^8$) slots in just under the DeepSeek rows and edges out the fast/cheap Groq Llama, with Gemini 3 Flash ($\approx 3.8$, $\approx 2.3\times 10^8$) and the older Gemini 2.5 Flash ($\approx 3.3$, $\approx 1.8\times 10^8$) trailing the lineage. Each newer generation buys both higher TIVF and higher FOM, and all three beat every frontier row on FOM by one to two orders of magnitude while carrying near-frontier TIVF.
Grok 4 sits mid-pack on both axes. TIVF $\approx 3.3$ (near Gemini 2.5 Flash) but frontier pricing (\$15/M) wins neither leaderboard.
When TIVF, not FOM, is the constraint, pay up. On the hard tail Opus 4.8 [max] leads SWE-Pro (70% vs 54%) and HLE w/ tools (58.5% vs 45.1%) (DataCamp, evals.report).
Effort reshuffles both axes. DeepSeek [none] is fastest/cheapest (FOM $\approx 1.1\times 10^9$) but loses 0.7 TIVF versus [high]; GPT-5.5 [max] buys $\sim$0.3 TIVF at a steep reasoning-billing premium.
Speed is now rewarded because $R$ sits in the FOM numerator. FOM prices intelligence power (ref-token/s) per dollar via eq. \ref{eq:fom}, so for a fixed TIVF and price a faster model delivers more intelligence per second and scores higher; the old speed paradox (where a slower model looked better) is resolved. Opus 4.8 [fast] ($R=145$) scores FOM $1.38\times 10^7$ vs [high] ($R=58$) at $6.71\times 10^6$, and Groq's fast/cheap Llama vaults to $3.58\times 10^8$ on low TIVF ($\approx 1.2$) almost entirely on speed and price.

FOM and TIVF answer different questions, so deciding which to use depends on your goal:

Your goal	Pick
Fixed budget: maximize total smart work over an hour or a day	DeepSeek V4 Flash high
Fixed budget, need higher TIVF than DeepSeek	Kimi K2.6 high
Fixed budget, easy tasks, cost and speed only	DeepSeek none
Hard agentic, coding, or research cliff	Opus 4.8 max (or GPT-5.5 xhigh)
Best single token, price irrelevant	Opus 4.8 max (highest TIVF in Figure 1)

Case Study Summary

Under the ballpark effort the split becomes:

Most intelligence power per dollar: DeepSeek V4 Flash at high thinking effort. For a fixed API budget, this row buys the most intelligence power per dollar. Kimi K2.6 (high) is the nearest runner-up if you need higher SWE-Pro/HLE scores without Opus pricing. Use DeepSeek max when the extra SWE/HLE points on the capability cliff matter; use none only when you accept lower TIVF for maximum thrift.

Most intelligence per token is Claude Opus 4.8 at max effort (then GPT-5.5 xhigh), accepting more than two orders of magnitude lower FOM.

Again key is not to read high FOM as "smartest model", so DeepSeek [high] is the value champion, not the capability champion. Confuse the two and you mis-deploy your use case.

My Thoughts

I'm in no way knowledgable enough to state if this is the correct FOM or even makes sense to the ML/AI community. I was just trying to think through Figure-of-merits due to my familarity with them in materials physics. This is just a "back of the envelope" type of analysis and proposal, not a standard.

There are a bunch of places where I think this probably falls apart or at least needs heavy caveating, so let me just list what may be the issues:

TIVF is only as good as the scores $S_k$ I feed it. I am leaning on vendor-reported numbers, evals that were not always run at matched effort, and benchmarks that may already be contaminated. So as the old adage goes, Garbage in, garbage out still applies.
My equal weights are just best guess. I'm set $w_k=1/6$ in eq. \ref{eq:tivf} because its easy and I have no other ideas for what it should be. In reality this needs to be tuned to the type or workload.
The effort knob is a guess. $\psi(e)$ in eq. \ref{eq:effort_cost} is my guess for billed tokens, not a real count of reasoning tokens per request. Even the throughput $R$, which I did account for per effort row, will still shift with prompt length, caching, and region.
Throughput and price are not fixed numbers. Reasoning modes, caching, batch APIs, and self-hosting can move $\tilde{c}$ and $R$ a lot, so I'm looking at one point in time.⁴
No accounting for verbosity. A model that scores well while spitting out twice as many tokens looks fine here, even if it actually feels worse to use.
My framing is output-biased. All the input-side smarts get under-sold because I plugged an output-only $\tilde{c}$ into eq. \ref{eq:fom}.
Not exhaustive. I skipped Mistral, Cohere, etc. and I am using Artificial Analysis estimates where the provider pages did not give me matched-effort numbers.

I'd be interested to see how well I did compared to legit AI analysts. Assuming this is reasonable and if I wanted to take this past a back-of-the-envelope toy, what would they suggest? My effort for this is to make the claim as narrow as possible:

When comparing language models, treat tokens as heterogeneous and report intelligence power per dollar alongside raw scores.

Footnotes

Reference model: GPT-4o mini anchors every axis in eq. \ref{eq:tivf_domain} with $(S_\mathrm{MMLU},S_\mathrm{GPQA},S_\mathrm{SWE},S_\mathrm{SWE-Pro},S_\mathrm{HLE},S_\mathrm{TB})=(72,50,40,12,16,10)$ and $\mathrm{TIVF}\equiv 1$ by definition (eq. \ref{eq:tivf_def}), chosen as a stable cost-quality floor, not the most-used production model. ↩
I write the TIVF unit as ref-token/raw-token: each physical (raw) token a model emits carries TIVF reference-equivalent (ref) tokens of intelligence. Keeping the singular ("token") reflects that TIVF is a scalar field over models, one value each, like "one meter." Splitting raw-token from ref-token is what makes the downstream power (ref-token/s) and FOM (ref-token·raw-token/($·s)) units come out honest. ↩
Related work and how this differs. The closest cousins are intelligence index over blended price (Artificial Analysis, WhatLLM.org, Analytical Insider), which ignores latency; tokens per dollar (FriendliAI on DeepSeek V4), which assumes token homogeneity; and economics' quality-adjusted price indices, where here "quality" is an explicit $\mathrm{TIVF}$ tied to public evals rather than a hedonic regression. What I add: (1) a named token rescaling (TIVF) with a broad-spectrum formula (eqs. \ref{eq:tivf_domain}, \ref{eq:tivf}); (2) effort-aware billing for hidden reasoning tokens (eq. \ref{eq:effort_cost}) so comparisons happen at matched thinking levels; and (3) a time factor via intelligence action (eq. \ref{eq:intelligence_action}) yielding FOM (eq. \ref{eq:fom}), separating token quality, the power $\mathrm{TIVF}\cdot R$, and power per dollar. ↩
Prices and throughputs are a snapshot, not fixed; the same headline price hides very different real cost depending on reasoning modes, caching, batch APIs, and self-hosting (Finout on Claude Opus 4.7 pricing). ↩
Output pricing from official provider pages (OpenAI, Anthropic, Google, DeepSeek, xAI, Moonshot, Alibaba Model Studio, Groq). Throughput $R$ is output token/s from Artificial Analysis (June 2026), matched to each row's effort; Moonshot, xAI, and Alibaba use AA estimates. Llama 3.3 70B uses GroqCloud (Groq the host, not xAI Grok). Reasoning models bill hidden tokens as output, and AA's $R$ already counts them, so the price multiplier $\psi(e)$ (eq. \ref{eq:effort_cost}) and $R$ partially overlap; I keep $\psi$ purely on price so reasoning load is billed once, an approximation rather than exact token accounting. ↩↩

References

[1] Y. Wang, X. Ma, others, {MMLU-Pro}: A More Robust and Challenging Multi-Task Language Understanding Benchmark, arXiv Preprint arXiv:2406.01574. (2024). https://doi.org/10.48550/arXiv.2406.01574.

[2] D. Rein, others, {GPQA}: A Graduate-Level Google-Proof {Q\&A} Benchmark, arXiv Preprint arXiv:2311.12022. (2024). https://doi.org/10.48550/arXiv.2311.12022.

[3] C.E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K.R. Narasimhan, {SWE}-bench: Can Language Models Resolve Real-World {GitHub} Issues?, in: International Conference on Learning Representations, 2024. https://doi.org/10.48550/arXiv.2310.06770.

Reuse and Attribution

Sunday, June 7, 2026

Drude Electrons

Solid State Physics

Most of us are familiar with the trope about physicists and their spherical cows. You'll certainly encounter this style of thinking when you take a solid-state physics course. But the reality is many of these toy model representations are actually not bad assumptions and they worked pretty well to describe discrepancies between different classes of materials like metals and insulators. I was organizing my bookself and cracked open [1] and was refereshed by one of my favorite of such model is the Drude theory of metals.

The discovery of the electron by J.J Thomson that described the corpuscles of charge in a metal got many at the turn of the 20th century thinking about how these charges moved inside a material. Paul Drude started to think about this problem and thought of conduction in the framework of Boltzmann gas kinetic theory. The main simplicity of the model is that electrons have some fundamental scattering interval (relaxation time or equally mean free path) represented by a collision rate $1/\tau$. Furthermore the scattering event randomizes the electron's momentum such that it loses all of its drift momentum after the collision. Finally, while an electron is moving freely (i.e. no scattering) it's moving under the Lorentz force from the applied electric and magnetic fields¹.

The primary equation that the Drude theory arrives at is momentum balance equation: $$ \begin{equation} \frac{d\langle \mathbf{p} \rangle}{dt} = -\frac{\langle \mathbf{p} \rangle}{\tau} + \mathbf{F} \label{eq:momentum_balance} \end{equation} $$

where $\langle p \rangle$ is the average momentum, $\tau$ is the relaxation time, and $F$ is the force. The derivation comes when you take the expectation value of the momentum balance equation over all possible states of the system.

The Drude model then approaches the problem by considering the behavior of electrons in an applied electric and magnetic field to determine the electric current density and then through relational observations of Ohm's law the conductivity of a material can be determined. A similar derivation is done for electrons in an electric and magnetic field to yield the Hall coefficient². For complete detail on the actual equations derived from $\eqref{eq:momentum_balance}$ take a look at [1].

The biggest win for this simple Drude model, where electrons are treated as a kinetic gas and then solved for in an applied electric and magnetic field, is that for idealized metals it worked pretty well as an analytical understanding of electrons in metals. It clearly falters because it's not a quantum theory and thus does not account for Fermi-Dirac statistics and electron correlations. But it was very easy to understand at the time as Maxwell's equations and Boltzmann's gas kinetic theory were well established frameworks.

Drude applied the theory to understand thermal conductivity of metals by assuming electron mobility was the primary carrier³. The application was straightforward because Drude just used the same kinetic theory to arrive at a thermal conductivity given as:

$$ \begin{equation} \kappa = \frac{4}{\pi}\frac{n\tau k_B^2 T}{m} \label{eq:thermal_conductivity} \end{equation} $$

where $n$ is the number of electrons per unit volume, $\tau$ is the relaxation time (i.e., scattering time), $k_B$ is the Boltzmann constant, $T$ is the temperature, and $m$ is the mass of the electron. This follows from the kinetic-theory form $\kappa = \frac{1}{3} c_v \bar{v} \ell$ with $c_v = \frac{3}{2} n k_B$, mean free path $\ell = \bar{v}\tau$, and Maxwell-Boltzmann average speed $\bar{v} = \sqrt{8 k_B T/(\pi m)}$. Combined with the Drude electrical conductivity, it yields a Lorenz number in rough agreement with experiment for many metals at room temperature. Below is shows a "Drude electron gas" animation,

I'm not going to go through numbers of different metals, again you can see all that in [2] or [1]. But the results from the Drude theory were used to compute the Lorenz number, Wiedemann-Franz law, and Seebeck coefficient for different metals. The Lorenz number and Hall coefficient were notable successes; specific heat (why Drude got this wrong) and thermopower were not. Generally speaking the theory was a good starting point with some success and failures. Once quantum theory became established Arnold Sommerfeld applied Fermi-Dirac statistics to the Drude model to arrive at the free electron model of metals that was more successful (Sommerfeld vs Drude).

Footnotes

In the Drude model, $\mathbf{F} = -e(\mathbf{E} + \mathbf{v} \times \mathbf{B})$ with externally applied $\mathbf{E}$ and $\mathbf{B}$; the fixed ion background provides charge neutrality. Electron-electron interactions enter through the scattering rate $\tau$, not as a separate mean-field force. ↩
The Hall coefficient $R_H$ relates the transverse Hall electric field to the longitudinal current and magnetic field, $E_y = R_H J_x B_z$. In the free-electron Drude model, $R_H = -1/(ne)$ (derivation and sign). ↩
In metals electrons are indeed the primary carriers of heat but for insulators and semiconductors lattice vibrations (i.e. phonons) are the primary carriers of heat. ↩

References

[1] S.H. Simon, The Oxford Solid State Basics, Oxford University Press, 2013.

[2] N.W. Ashcroft, N.D. Mermin, Solid State Physics, Holt, Rinehart and Winston, 1976.

Reuse and Attribution

Search Blogs

Sunday, July 5, 2026

Intelligence Power per Dollar

Why benchmarks, price, and speed are not enough

Naming by analogy

Token Intelligence Value Factor (TIVF)

Capability spectrum

Token intelligence rescaling

Including Reasoning effort

Figure of merit: intelligence power per dollar

Case study: Ballpark comparison (June 2026)

Case Study Summary

My Thoughts

Footnotes

References

Sunday, June 7, 2026

Drude Electrons

Footnotes

References