I was thinking while using codex this weekend that not every token is equally as valuable as any other token. Much of the metrics you see are related to token generation rate or cost per token but that is only a partial picture. When I use a model I care about how much useful intelligence output I get for how much it costs and how long it takes to achieve the desired utility. Since benchmark accuracies, API price, and tokens per second are almost always documented, I was thinking how come none of them alone answers the question I actually want:
How much useful token intelligence power am I buying per dollar?
Below I am proposing a kind of FOM (Figure of Merit) that compares models on quality, speed, and cost together. I have no idea how useful this will be or how well my definitions for "token intelligence" will hold up. But I think it is a useful thought experiment. Again, the central idea is that not all tokens are equally valuable, which is indeed strongly tied to the model capabilities. In my view, the natural unit for combining value with time and money is what I would call intelligence action, by analogy with physical action and information-theoretic memory.
Why benchmarks, price, and speed are not enough
Three quantities dominate public comparisons:
| Quantity | What it measures | What it misses |
|---|---|---|
| Benchmark score | Task accuracy on fixed evals | Dollar cost and latency |
| Cost per token | API price | Whether the token is any good |
| Throughput (token/s) | Generation speed | Quality of each token |
A frontier model can score a few points higher yet cost an order of magnitude more. A cheap model can be fast yet unreliable on the tasks you care about. Leaderboards such as the Artificial Analysis Intelligence Index already compress many evals into one score, and several sites rank "value" as intelligence divided by price, including WhatLLM.org and Analytical Insider. Those are useful, but they still omit time: two models with the same intelligence-per-dollar can feel very different if one streams at 250 token/s and the other at 45 token/s.3
I want one scalar that rewards quality and thrift together, while keeping the pieces visible, including speed as an explicit tradeoff rather than something the score silently maximizes.
Naming by analogy
My take
I do not know if this is a good take/analogy but this is how I am naming/coining terms. There might already be a term for this but I do not know what it is.
In physics, action scales as energy × time. In information theory, information capacity over time scales as bit × second (I believe this is sometimes also called memory).
Language models output tokens, not joules or bits although one could think about tokens in these terms :wink:. By the same pattern, intelligence action scales as token × second, and its time-rate is what I call intelligence power (token / second), the quantity the figure of merit below actually prices out. This is just nomenclature I choose it is not a properly derived claim. The analogy is intentionally shallow, I'm just borrowing the product-of-content-and-time structure without importing the rest. The payoff is trying to gain meaningful readable units.
Token Intelligence Value Factor (TIVF)
I'm now going to describe what I call the Token Intelligence Value Factor (TIVF). It is the intelligence content of one generated token, expressed in reference-equivalent tokens. To keep units honest, I'll distinguish two kinds of token throughout: a raw-token is the physical unit you are actually billed for and that throughput $R$ and price $\tilde{c}_\mathrm{out}$ are measured in, while a ref-token is the intelligence-equivalent unit, i.e. one raw-token emitted by the reference model. So one raw-token from model $m$ is worth $\mathrm{TIVF}(m)$ ref-tokens from a fixed reference model $m_\mathrm{ref}$, and TIVF carries the unit ref-token/raw-token.
I'm going to define this as
$$ \begin{equation} \mathrm{TIVF}(m_\mathrm{ref}) \equiv 1\ \text{ref-token}/\text{raw-token}. \label{eq:tivf_def} \end{equation} $$
So TIVF is not dimensionless, its has the unit ref-token/raw-token2, but importantly it is a token whose "intelligence" has been rescaled. A model with $\mathrm{TIVF}=2$ produces tokens that count double toward intelligence action relative to a defined reference model.
Capability spectrum
This is a hard one to define because intelligence is not one number. A model can score well on knowledge yet fail on agentic terminal work, or ace GPQA yet stumble on SWE-bench Pro. To try and approximate a broad spectrum, I'll assign each model six public scores $S_k(m)$, on a 0--100 scale and $k$ is the score axis, each measured at some thinking effort, $e$, if supported:
| Axis $k$ | Benchmark | What it probes |
|---|---|---|
| Knowledge | MMLU-Pro [1] | Broad multi-subject knowledge |
| Science | GPQA Diamond [2] | Graduate science Q&A |
| Coding | SWE-bench Verified [3] | Real GitHub issue resolution |
| Hard coding | SWE-bench Pro | Harder multi-language software engineering |
| Reasoning | Humanity's Last Exam w/ tools | Multidisciplinary expert reasoning |
| Agentic | Terminal-Bench 2.x | Shell/terminal agent coordination |
I am trying to mirror the spirit of composite indices that are shown in Artificial Analysis Intelligence Index but keeps each axis visible instead of collapsing them.
Token intelligence rescaling
For each score axis $k$, I'll normalize to the reference model $m_\mathrm{ref}$, which will be GPT-4o mini 1, and apply a mild superlinear exponent $\gamma=1.2$. The thinking is this makes sure that when more performant models do especially well on the hardest benchmarks, those factors still show up, even if many models already get top scores on easier. What We get is something like
$$ \begin{equation} \mathrm{TIVF}_k(m,e) = \left(\frac{S_k(m,e)}{S_k(m_\mathrm{ref})}\right)^{\gamma}. \label{eq:tivf_domain} \end{equation} $$
$S$ is the score of the model $m$ on the axis $k$. The reference model anchors every axis: $\mathrm{TIVF}_k(m_\mathrm{ref})=1$ for all $k$. The score ratio $S_k(m,e)/S_k(m_\mathrm{ref})$ is itself dimensionless; the ref-token base scale is injected only through the normalization $\mathrm{TIVF}(m_\mathrm{ref})\equiv 1\ \text{ref-token}/\text{raw-token}$ from eq. \ref{eq:tivf_def}.
The choice for GPT-4o mini as reference model is because it seems to be the most-used model across stacks. Although it might be that Anthropic models lead enterprise API usage.
In order to prevent a single strong axis to dominate/compensate for the weaker ones, we can take a weighted geometric mean of the axes:
$$ \begin{equation} \boxed{ \begin{aligned} \mathrm{TIVF}(m,e) &= \kappa(m)\cdot \exp\left(\sum_k w_k \ln \mathrm{TIVF}_k(m,e)\right) \\ &= \kappa(m)\cdot\prod_k \mathrm{TIVF}_k(m,e)^{\,w_k} \quad\big[\text{ref-token}/\text{raw-token}\big] \end{aligned} } \label{eq:tivf} \end{equation} $$.
Then we use equal weights (i.e., $w_k=1/6$). One thing that we need to account for is models that can handle larger context do not automatically indicate "smarter" answers, but rather address the complexity/class of problems the model can attempt to address. The inclusion of a context modifier $\kappa(m)$ from eq. \ref{eq:context_modifier} tries to address what tasks are "attemptable" by a model.
By construction $\mathrm{TIVF}(m_\mathrm{ref},e)\equiv 1$ ref-token/raw-token when every $S_k(m_\mathrm{ref})=S_k(m_\mathrm{ref})$. A model with $\mathrm{TIVF}=4$ produces tokens that count quadruple toward intelligence action relative to the reference across the full spectrum, not on a single leaderboard. Note the geometric (not arithmetic) mean was deliberate on my part. A weak axis will drag the product down rather than being averaged away by a strong one, so a hard-capability cliff (e.g. high knowledge but failing coding) is penalized rather than hidden. Any scalar can still mask a single-axis failure, which is why the per-axis $S_k$ are reported alongside TIVF.
$$ \begin{equation} \kappa(m) = 1 + \beta\cdot\ln\left(\frac{L_m}{L_\mathrm{ref}}\right), \qquad \beta = 0.087. \label{eq:context_modifier} \end{equation} $$
With $m_\mathrm{ref}$ corresponding to GPT-4o mini ($L_\mathrm{ref}=128\,\mathrm{k}$ tokens), a 1M-token model picks up roughly an 18% multiplicative bonus ($\beta=0.087$). The exponent is small on purpose so context is an enabler, not a substitute for reasoning scores. Models with context windows below $L_\mathrm{ref}$ get $\kappa<1$, an intentional penalty reflecting the narrower class of tasks they can attempt. In truth, this is probably the framework's least defensible knob since context length is a capacity constraint, not intelligence. When a task fits comfortably inside every model's window $\kappa$ rewards headroom that yields no real utility. This means we should treat $\kappa$ as a task-attemptability gate, not an intelligence premium.
Task-specific workloads can replace eq. \ref{eq:tivf} with $\sum_k w_k\ln\mathrm{TIVF}_k$ using custom $w_k$ (e.g., overweight Hard coding for agentic coding) without changing the FOM details below. One other aspect is how "intelligence action" relates to "intelligence power". I keep both: eq. \ref{eq:intelligence_action} defines the action as a cognitive footprint (ref-token$\cdot$s) that keeps TIVF visible as a token rescaling rather than collapsing into raw throughput, while its time-rate $\mathrm{TIVF}\cdot R$ is the intelligence power (ref-token/s) that the figure of merit below actually prices out.
Including Reasoning effort
Most models expose some form of effort levels (OpenAI reasoning_effort, Anthropic extended/adaptive thinking, DeepSeek thinking modes) that change three things at once: the spectrum scores $S_k(m,e)$, hidden reasoning tokens billed as output, and wall-clock throughput $R$ (OpenAI GPT-5.5, Anthropic Opus 4.8, DeepSeek V4 pricing, evals.report).
To account for this, I tag each row with an effort level $e\in{\mathrm{none},\mathrm{high},\mathrm{max},\mathrm{fast}}$ and use an effort-specific benchmark scores when available. Hidden reasoning is modeled as a kind of effective output-price multiplier:
$$ \begin{equation} \tilde{c}_\mathrm{out}(m,e) = c_\mathrm{out}(m)\cdot \Big[1 + \lambda\cdot\big(\psi(e)-1\big)\Big], \qquad \lambda = 0.65, \label{eq:effort_cost} \end{equation} $$
where $\psi(e)$ is a billed-token multiplier ($\psi=1$ for non-thinking, $\sim 2$ at default thinking, $\sim 5$ at max/xhigh effort). Throughput $R$ is taken directly from provider API measurements at the stated effort level (Artificial Analysis, June 2026 snapshot)5, not scaled again by a latency factor. Note $R$ here is a response output rate, which for reasoning models includes hidden reasoning tokens, so $R$ already absorbs the thinking slowdown; $\psi$ acts only on the billed price, keeping rate ($R$) and price ($\psi$) on separate factors so reasoning is not billed twice (this is an approximation, see limitations). Fast mode rows use Anthropic fast-tier pricing with throughput scaled from the standard Opus row ($\times 2.5$).
The idea is this is a proxy for compute spent per visible token, not a literal token accounting. It prevents comparing a cheap non-thinking call against a frontier model evaluated at max effort.
Figure of merit: intelligence power per dollar
Intelligence action is a kind of cognitive footprint of a generation where the TIVF a token carries, held over the wall-clock time it takes to produce,
$$ \begin{equation} \mathcal{A}_I = \mathrm{TIVF}\cdot t \quad \Big[\tfrac{\text{ref-token}\cdot\mathrm{s}}{\text{raw-token}}\Big]. \label{eq:intelligence_action} \end{equation} $$
In the framing of Principle of Least Action, intelligence action beomce a kind of latency weight that for a fixed TIVF you minimize $\mathcal{A}_I$ by driving the time $t$ down, so a model that delivers the same intelligence faster leaves a smaller sweep. What you actually use per second is the time-rate of that intelligence, so a kind of intelligence power,
$$ \begin{equation} P_I = \mathrm{TIVF}\cdot R \quad \big[\text{ref-token}/\mathrm{s}\big], \label{eq:fom_deriv} \end{equation} $$
where $R$ is the measured output raw-token/s at effort $e$. Dividing by the effective cost rate $\tilde{c}_\mathrm{out}$ (USD per raw-token) gives the FOM that has units of intelligence power per dollar:
$$ \begin{equation} \boxed{ \mathrm{FOM}(m,e) = \frac{\mathrm{TIVF}(m,e)\cdot R(m,e)}{\tilde{c}_\mathrm{out}(m,e)} \quad\Big[\tfrac{\text{ref-token}\cdot\text{raw-token}}{\$\cdot\mathrm{s}}\Big]. } \label{eq:fom} \end{equation} $$
The numerator $\mathrm{TIVF}\cdot R$ is intelligence power (ref-token/s), but note that the denominator is per unit price, not per dollar spent because dividing by the spend rate $\tilde{c}_\mathrm{out}\cdot R$ (USD/s) cancels $R$ and leaves the speed-blind value efficiency $\mathrm{TIVF}/\tilde{c}_\mathrm{out}$ (ref-token/\$). So FOM is by design value-per-dollar $\times$ throughput: $R$ stays up top to reward latency, and the raw-token in ref-token$\cdot$raw-token/($\cdot$s) is the receipt. The $\tilde{c}_\mathrm{out}$ reduces to raw output pricing when $\psi=1$. Input/cache pricing matters when using in RAG and agents and a blended $\tilde{c}$ is easy to substitute in eq. \ref{eq:fom} if your workload is input-dominated. As a reminder $m$ the model and $e$ the effort level.
Reading the number
Larger FOM means more intelligence value per dollar, weighted by throughput for the stated effort! It is a value metric, not a capability one, so compare at matched effort: a budget model can win on FOM while losing on TIVF. Price dominates the spread ($c_\mathrm{out}$ spans $\sim$100$\times$ vs $R$ $\sim$10$\times$ and TIVF $\sim$5$\times$), so FOM is, to first order, a tokens-per-dollar ranking that TIVF only tilts.
Case study: Ballpark comparison (June 2026)
Note
Snapshot date: June 2026. Pricing and throughput numbers move quickly; treat the tables as ballpark illustrations, not a live leaderboard. Also LLMs were used heavily here to generate the tables and plots. Some aspects of the pattern analysis was also done with LLMs.
The first table lists cross-vendor inputs at matched max/xhigh effort; the second lists the Claude Opus lineage. A third table shows derived TIVF and FOM from eqs. \ref{eq:tivf} and \ref{eq:fom}. $S$ is the arithmetic mean of the six axis scores (display only). Output pricing $c_\mathrm{out}$ is from official provider pages (June 2026); throughput $R$ is output token/s from Artificial Analysis provider API measurements at the stated effort5. Benchmark scores are from vendor cards, evals.report, BenchLM, MorphLLM, and Anthropic system cards. Coverage spans OpenAI, Anthropic, Google, DeepSeek, xAI (Grok), Moonshot (Kimi), and Alibaba (Qwen), plus one open-weight row on GroqCloud; Groq (inference host) is not Grok (xAI's model).
Cross-vendor Claude Opus lineage Derived TIVF and FOMInput tables and derived TIVF / FOM
Model Effort S SWE-Pro HLE tools cout R DeepSeek V4 Flash max 69.3 54.0% 45.1% $0.28 / M 106 DeepSeek V4 Flash high 67.2 52.6% 40.3% $0.28 / M 95 DeepSeek V4 Flash none 61.1 49.1% 28.0% $0.28 / M 94 GPT-4o mini (ref.) high 33.3 12.0% 16.0% $0.60 / M 66 Gemini 2.5 Pro high 71.4 54.2% 51.4% $10.0 / M 141 Gemini 3 Pro high 74.1 55.0% 52.0% $12.0 / M 130 Gemini 2.5 Flash high 60.5 45.0% 38.0% $2.50 / M 225 Gemini 3 Flash high 67.7 50.0% 43.5% $3.00 / M 300 Gemini 3.1 Flash high 69.0 52.0% 45.0% $3.00 / M 450 Kimi K2.6 high 72.8 58.6% 54.0% $4.0 / M 88 Qwen3 Max high 69.2 50.0% 49.5% $6.0 / M 72 Grok 4 high 65.5 48.0% 42.0% $15.0 / M 78 GPT-5.5 max 76.0 58.6% 52.2% $30.0 / M 53 GPT-5.5 high 72.1 55.0% 50.0% $30.0 / M 56
Model Effort S SWE-Pro HLE tools cout R Claude Opus 4.5 high 67.8 45.9% 43.4% $25.0 / M 42 Claude Opus 4.6 high 72.5 53.4% 53.1% $25.0 / M 45 Claude Opus 4.7 high 77.0 64.3% 54.7% $25.0 / M 45 Claude Opus 4.7 max 77.8 65.5% 55.5% $25.0 / M 55 Claude Opus 4.8 high 79.2 69.2% 57.9% $25.0 / M 58 Claude Opus 4.8 max 79.8 70.0% 58.5% $25.0 / M 58 Claude Opus 4.8 fast 79.2 69.2% 57.9% $50.0 / M 145
Model Effort TIVF [ref-token/raw-token] FOM [ref-tok·raw-tok/($·s)] DeepSeek V4 Flash none 3.23 1.08 × 109 DeepSeek V4 Flash high 3.76 7.73 × 108 DeepSeek V4 Flash max 3.94 4.15 × 108 Gemini 3.1 Flash high 4.12 3.75 × 108 Llama 3.3 70B (Groq) high 1.18 3.58 × 108 Gemini 3 Flash high 3.81 2.31 × 108 Gemini 2.5 Flash high 3.28 1.79 × 108 GPT-4o mini high 1.00 6.67 × 107 Kimi K2.6 high 3.85 5.13 × 107 Gemini 2.5 Pro high 4.15 3.55 × 107 Claude Haiku 4.5 high 2.59 3.36 × 107 Gemini 3 Pro high 4.33 2.84 × 107 Qwen3 Max high 3.55 2.58 × 107 Claude Opus 4.8 fast 4.77 1.38 × 107 Grok 4 high 3.27 1.03 × 107 Claude Sonnet 4.6 high 3.70 8.22 × 106 Claude Opus 4.8 high 4.77 6.71 × 106 Claude Opus 4.7 high 4.57 4.98 × 106 GPT-5.5 high 4.19 4.74 × 106 Claude Opus 4.6 high 4.21 4.59 × 106 Claude Opus 4.5 high 3.34 3.40 × 106 Claude Opus 4.8 max 4.82 3.11 × 106 Claude Opus 4.7 max 4.64 2.83 × 106 GPT-5.5 max 4.49 2.20 × 106
| Figure 1. FOM vs TIVF scatter (June 2026 snapshot). |
| Figure 2. FOM ranked by model and effort level. |
Figure 1 plots FOM against TIVF with the point color representing provider and the marker shape representing thinking effort (two-column legend). Figure 2 ranks every row by FOM on a log scale. The bottom line from Figure 1 seems to be if you want the most intelligence power per dollar, use DeepSeek V4 Flash with high thinking. Opus 4.8 [max] clearly scores higher on TIVF ($\approx 4.8$ vs $\approx 3.8$ for DeepSeek [high]) but buys roughly 250$\times$ less intelligence power per dollar. Several patterns fall out immediately:
- Broad-spectrum TIVF separates frontier from budget models. At max effort, Opus 4.8 ($\approx 4.8$) leads GPT-5.5 ($\approx 4.5$), Gemini 3 Pro ($\approx 4.3$), and DeepSeek V4 Flash ($\approx 3.9$).
- DeepSeek wins FOM because price is low and speed is high. At \$0.28/M output and $\sim$95--106 token/s it buys roughly $\sim$250$\times$ more intelligence power per dollar than Opus 4.8 at max effort; Kimi K2.6 and Qwen3 Max trail.
- The Gemini Flash line is the best-value major-lab option. At \$3/M output and 300--450 token/s, Gemini 3.1 Flash (TIVF $\approx 4.1$, FOM $\approx 3.7\times 10^8$) slots in just under the DeepSeek rows and edges out the fast/cheap Groq Llama, with Gemini 3 Flash ($\approx 3.8$, $\approx 2.3\times 10^8$) and the older Gemini 2.5 Flash ($\approx 3.3$, $\approx 1.8\times 10^8$) trailing the lineage. Each newer generation buys both higher TIVF and higher FOM, and all three beat every frontier row on FOM by one to two orders of magnitude while carrying near-frontier TIVF.
- Grok 4 sits mid-pack on both axes. TIVF $\approx 3.3$ (near Gemini 2.5 Flash) but frontier pricing (\$15/M) wins neither leaderboard.
- When TIVF, not FOM, is the constraint, pay up. On the hard tail Opus 4.8 [max] leads SWE-Pro (70% vs 54%) and HLE w/ tools (58.5% vs 45.1%) (DataCamp, evals.report).
- Effort reshuffles both axes. DeepSeek [none] is fastest/cheapest (FOM $\approx 1.1\times 10^9$) but loses 0.7 TIVF versus [high]; GPT-5.5 [max] buys $\sim$0.3 TIVF at a steep reasoning-billing premium.
- Speed is now rewarded because $R$ sits in the FOM numerator. FOM prices intelligence power (ref-token/s) per dollar via eq. \ref{eq:fom}, so for a fixed TIVF and price a faster model delivers more intelligence per second and scores higher; the old speed paradox (where a slower model looked better) is resolved. Opus 4.8 [fast] ($R=145$) scores FOM $1.38\times 10^7$ vs [high] ($R=58$) at $6.71\times 10^6$, and Groq's fast/cheap Llama vaults to $3.58\times 10^8$ on low TIVF ($\approx 1.2$) almost entirely on speed and price.
FOM and TIVF answer different questions, so deciding which to use depends on your goal:
| Your goal | Pick |
|---|---|
| Fixed budget: maximize total smart work over an hour or a day | DeepSeek V4 Flash high |
| Fixed budget, need higher TIVF than DeepSeek | Kimi K2.6 high |
| Fixed budget, easy tasks, cost and speed only | DeepSeek none |
| Hard agentic, coding, or research cliff | Opus 4.8 max (or GPT-5.5 xhigh) |
| Best single token, price irrelevant | Opus 4.8 max (highest TIVF in Figure 1) |
Case Study Summary
Under the ballpark effort the split becomes:
Most intelligence power per dollar: DeepSeek V4 Flash at high thinking effort. For a fixed API budget, this row buys the most intelligence power per dollar. Kimi K2.6 (high) is the nearest runner-up if you need higher SWE-Pro/HLE scores without Opus pricing. Use DeepSeek max when the extra SWE/HLE points on the capability cliff matter; use none only when you accept lower TIVF for maximum thrift.
Most intelligence per token is Claude Opus 4.8 at max effort (then GPT-5.5 xhigh), accepting more than two orders of magnitude lower FOM.
Again key is not to read high FOM as "smartest model", so DeepSeek [high] is the value champion, not the capability champion. Confuse the two and you mis-deploy your use case.
My Thoughts
I'm in no way knowledgable enough to state if this is the correct FOM or even makes sense to the ML/AI community. I was just trying to think through Figure-of-merits due to my familarity with them in materials physics. This is just a "back of the envelope" type of analysis and proposal, not a standard.
There are a bunch of places where I think this probably falls apart or at least needs heavy caveating, so let me just list what may be the issues:
- TIVF is only as good as the scores $S_k$ I feed it. I am leaning on vendor-reported numbers, evals that were not always run at matched effort, and benchmarks that may already be contaminated. So as the old adage goes, Garbage in, garbage out still applies.
- My equal weights are just best guess. I'm set $w_k=1/6$ in eq. \ref{eq:tivf} because its easy and I have no other ideas for what it should be. In reality this needs to be tuned to the type or workload.
- The effort knob is a guess. $\psi(e)$ in eq. \ref{eq:effort_cost} is my guess for billed tokens, not a real count of reasoning tokens per request. Even the throughput $R$, which I did account for per effort row, will still shift with prompt length, caching, and region.
- Throughput and price are not fixed numbers. Reasoning modes, caching, batch APIs, and self-hosting can move $\tilde{c}$ and $R$ a lot, so I'm looking at one point in time.4
- No accounting for verbosity. A model that scores well while spitting out twice as many tokens looks fine here, even if it actually feels worse to use.
- My framing is output-biased. All the input-side smarts get under-sold because I plugged an output-only $\tilde{c}$ into eq. \ref{eq:fom}.
- Not exhaustive. I skipped Mistral, Cohere, etc. and I am using Artificial Analysis estimates where the provider pages did not give me matched-effort numbers.
I'd be interested to see how well I did compared to legit AI analysts. Assuming this is reasonable and if I wanted to take this past a back-of-the-envelope toy, what would they suggest? My effort for this is to make the claim as narrow as possible:
When comparing language models, treat tokens as heterogeneous and report intelligence power per dollar alongside raw scores.
Footnotes
-
Reference model: GPT-4o mini anchors every axis in eq. \ref{eq:tivf_domain} with $(S_\mathrm{MMLU},S_\mathrm{GPQA},S_\mathrm{SWE},S_\mathrm{SWE-Pro},S_\mathrm{HLE},S_\mathrm{TB})=(72,50,40,12,16,10)$ and $\mathrm{TIVF}\equiv 1$ by definition (eq. \ref{eq:tivf_def}), chosen as a stable cost-quality floor, not the most-used production model. ↩
-
I write the TIVF unit as ref-token/raw-token: each physical (raw) token a model emits carries TIVF reference-equivalent (ref) tokens of intelligence. Keeping the singular ("token") reflects that TIVF is a scalar field over models, one value each, like "one meter." Splitting raw-token from ref-token is what makes the downstream power (ref-token/s) and FOM (ref-token·raw-token/($·s)) units come out honest. ↩
-
Prices and throughputs are a snapshot, not fixed; the same headline price hides very different real cost depending on reasoning modes, caching, batch APIs, and self-hosting (Finout on Claude Opus 4.7 pricing). ↩
-
Output pricing from official provider pages (OpenAI, Anthropic, Google, DeepSeek, xAI, Moonshot, Alibaba Model Studio, Groq). Throughput $R$ is output token/s from Artificial Analysis (June 2026), matched to each row's effort; Moonshot, xAI, and Alibaba use AA estimates. Llama 3.3 70B uses GroqCloud (Groq the host, not xAI Grok). Reasoning models bill hidden tokens as output, and AA's $R$ already counts them, so the price multiplier $\psi(e)$ (eq. \ref{eq:effort_cost}) and $R$ partially overlap; I keep $\psi$ purely on price so reasoning load is billed once, an approximation rather than exact token accounting. ↩↩
References