METHODOLOGY · EDITORIAL · EDITOR

What the Numbers Can Bear

What 40k statistics can meaningfully tell us — and what 11th edition’s new variables will do to the answer.

The Editor · Protocol · Methodology editorial

This is an editorial: it makes an argument and lands on a position. But it is built on a methodology spine, and I will keep the two visible — marking where a claim is structural, true by arithmetic, and where it is my read, a judgement you are free to weigh differently. The question underneath it is the one every reader of a win-rate table eventually asks: is this telling me something, or am I being entertained?

The budget

Start with the one fact everything else rests on. Competitive Warhammer 40,000 generates a finite, roughly knowable volume of recorded games. A weekend of tournaments produces some number of results; a month produces more; an edition, more again — but at every horizon the figure is bounded, and it is not large the way people imagine “data” is large. It is thousands of games a week, not millions.

That volume is a budget, and every statistic you read is an act of spending it. The most familiar one — a faction’s win rate — takes the period’s games and divides them across the twenty-nine factions. That is already a real division: in the Archive’s own most recent weekly pull, the busiest faction sat on a few hundred games and the quietest on barely thirty. A win rate built on thirty games is not a measurement so much as a rumour. That is structural, not a complaint — it is what the arithmetic of a small binomial sample says, and the Archive’s Six Bins piece works it through: across a five-round event, the spread around a single player’s “true” rate is something like twenty-two points wide.

So before any question of value: the budget is small, and every number is a slice of it.

Three ways to count

There is no single 40k dataset, and the differences between the ones we have are not quality differences — they are different decisions about how to spend the budget. Three are worth holding side by side, because between them they map the whole trade.

Scatter plot of three datasets — 40kstats, Stat Check, and the Infinite Archive — on a trade-off between sample size and how current the picture is. 40kstats sits high on sample size (about 280,000 games) but reaches back across the whole edition. Stat Check sits in the middle (about 15,000 games, a multi-week window). The Infinite Archive sits lowest on sample (about 2,900 games) but covers only the current week. — *Three real datasets, from the Archive’s Week 21 snapshots, on the trade every dataset makes — recency paid for in sample size.*

Whole sets — 40kstats, drawing on the TableTop Battles record — take the entire edition. The Archive’s recent snapshot of it carried something near 280,000 games. That is an enormous sample, and the confidence intervals it supports are tight enough to say something even about rarely-played factions. The cost runs the other way: it is an average over the whole edition, blending the meta before a dataslate with the meta after it, slow to register that anything has changed at all. It tells you, very precisely, what the edition was.

Restricted pulls — Stat Check is the type — filter. They take a recent window, set a floor on event size, and report that. The Archive’s snapshot of Stat Check covered about 15,000 games. That is the middle of the trade: recent enough to be about the current meta, filtered for event quality, but a fraction of the whole set’s sample — and every filter choice, which events count and how long the window runs, is an assumption baked quietly into the number.

Weekly pulls — what the Archive does — take one ISO week. Roughly 2,900 games in the Week 21 pull. That is the most current picture available and the smallest; it is built to catch what is moving, and it is hopeless for fine distinctions.

None of the three is the right one. That is the structural point of the figure: they sit on a single frontier, and recency is paid for in sample size. Which to reach for depends entirely on the question — and the Archive’s actual job, the thing it exists to do, is to say out loud which trade a given number was struck on. A number quoted without its window is a number quoted without its meaning.

What today’s data can bear

Here is the part that is uncomfortable and also simply true. Even now, with only twenty-nine factions and the plainest possible metric, the data supports coarse claims and not fine ones.

It can carry tiers — this cluster of factions is strong, that cluster is weak — because a tier is a wide bin and the noise fits inside it. It can carry large, sustained movement: a faction that climbs ten points over a month, across several independent trackers at once, is telling you something real. And it can carry cross-source agreement, which the Archive treats as its firmest signal — when four trackers built on different windows all say the same thing, the thing is probably true.

What it cannot carry is precision. “Faction A is at 53.2% and faction B at 51.8%, so A is the stronger pick” is a sentence the data does not support, and three of the Archive’s methodology pieces explain why from three directions. The Big Soup Problem: pooling games from many events into one rate reports a tidiness the event-to-event behaviour never had. Six Bins: the binomial noise floor in a five-round event is wide enough that one player’s record says very little about their faction. Swiss Isn’t Random: tournament games are not independent draws — the pairing system correlates them, which inflates the true variance above what the standard formula reports. None of that is opinion; it is the arithmetic of the format.

So the honest resolution of today’s data is: tiers yes, trends yes, decimal places no.

Faction, detachment, disposition

The sections above spoke of factions, because the faction is where most public 40k statistics still sit. But the meta is sliced finer than that, and 11th edition is about to slice it finer again.

Detachment-level data is not new. Some trackers already publish results split by detachment — Warpfriends, among others, breaks its win rates down below the faction line. The Archive itself reports at faction level for now; that is a choice about where to spend the budget, not a frontier of what is possible. What 11th edition genuinely adds is a third axis: disposition — the scoring mechanic each army brings to the table. Outcomes will be read not only faction against faction, but detachment against detachment and disposition against disposition.

Old variable or new, the structural effect is identical. Each one layered on divides the same finite budget again.

Two ladders side by side. The left ladder, 'win rate of', shows games-per-bucket falling from about 1,700 at the faction level (error bar near plus-or-minus 2 points) to 500 at faction plus detachment (plus-or-minus 4) to 100 at faction plus detachment plus disposition (plus-or-minus 10). The right ladder, 'matchup of', shows the same descent from 115 games per faction-versus-faction matchup down to about 10 games per faction-detachment cell and under 1 game per faction-detachment-disposition cell — the last marked as no estimate possible. A horizontal dashed line marks an 80-game small-sample threshold. — Each variable layered on divides the same finite volume again. Win rates stay above the small-sample line but their error bars balloon; the faction-detachment-disposition matchup grid has more cells than the month has games.

Follow the win-rate ladder on the left of the figure. A faction win rate, in an illustrative month of around fifty thousand games, rests on something like 1,700 games — and a 95% error bar near ±2 points. Split that metric to faction-and-detachment and each bucket holds about 500 games; the error bar opens to roughly ±4. Split it again to faction-detachment-disposition and you are at about 100 games a bucket and an error bar near ±10. Note what has and has not happened: the win-rate ladder does not collapse — 100 games still clears the small-sample line — but the confidence drains out of it at every step. A win rate quoted to ±2 points can anchor a real claim. The same metric at ±10 is barely a tier label wearing a decimal.

The matchup ladder, on the right, is the harder story. A faction-versus-faction matchup already runs thin — a few hundred distinct matchups, around 115 games each on average, an error bar near ±9 before you have done anything exotic. Take it to faction-detachment matchups and the grid is some 5,000 cells holding about 10 games apiece — not an estimate, a rumour. Take it to faction-detachment-disposition matchups and the arithmetic turns absurd in a way worth stating plainly: that grid runs to something like 125,000 cells, and the month had around 50,000 games. There are more cells than games. The matrix is not thin; it is, structurally, mostly empty — most of its cells will hold zero games, and the rest one or two.

That is the whole mechanism, and it is just division: more variables do not enrich the data, they spread the same fixed information thinner and widen the error bar on every number that comes out.

A win rate is a comparison

There is a second hazard stacked on top of the first, and it is the one most readers never see.

A win rate, on its own, tells you nothing. “52%” is not a fact about a faction. It becomes a fact only when you set it against something — against an even 50%, against the field average, against another faction’s rate. Every reading of a win rate is a comparison; the number is meaningless in isolation. That is not a quirk of 40k data — it is what a rate is.

So a faction table is never one measurement. It is dozens of comparisons run at once — every faction implicitly weighed against every other. A detachment table is a hundred. A detachment matchup grid is thousands.

And that is where the multiple-comparisons problem bites — harder the more cells you have. Run a handful of comparisons and the occasional false positive is a risk you carry. Run thousands and it stops being a risk and becomes a certainty: among five thousand matchup cells, a fistful will show what looks like a real edge and is nothing but a run of dice landing one way. Not might — will. The arithmetic of running many tests guarantees a crop of false positives with no real effect under them at all; that is structural, the same maths whatever the subject.

The mirror error is quieter and just as real. A false negative: a genuinely strong option whose one thin sample happened to land near 50% reads as unremarkable, and is passed over — because “the numbers don’t support it.” The thinner the cell, the easier it is for a real edge to hide inside the noise and for a phantom edge to shine out of it.

This is the true danger of the dimensionality explosion, the part beyond thin samples. A matchup grid with thousands of populated cells is thousands of chances to be fooled, in both directions at once — and the fuller and more authoritative the grid looks, the larger the share of it that is illusion.

Three confounds that do not go away

Sample size is the problem you can see. There are three you cannot, and none of them is fixed by collecting more games.

The first, and the least discussed, is what the games actually are. Every win rate carries a silent assumption: that the games inside it are attempts to win, played with lists chosen to win. That assumption is convenient, and it is often wrong.

A whole-set tracker like TableTop Battles records games from an enormous spread of events, and “tournament” is a wide word. Some of those games are top tables at majors, played by people optimising as hard as they can. Some are the back tables of a small local event that is, honestly, an afternoon of garage Warhammer with a scoreboard bolted on. Both pour into the same pool and the same win rate, and the rate cannot tell you how much of it was which.

It is not only the whole set. Even the Archive’s own pull — events of five-plus rounds, ostensibly the competitive end — is not uniform. At a local GT some players came to win the event; others came for a weekend with the hobby, and reached it with real concessions already baked into their list. They brought the models they own, or the ones they had finished painting, or the list that keeps their army on-theme — not the list a spreadsheet would have handed them. Those concessions are invisible in the result: a faction that lost because its pilot fielded the painted half of his codex is recorded identically to a faction that lost on its merits.

So a faction’s win rate is not “how strong that faction is, played to win.” It is “how that faction did, played by whoever played it, with whatever list their wallet, their paint table and their taste allowed.” Those are different questions, and the data answers the second while we ask it the first. The gap between them is not small and it is not random, and no quantity of extra games closes it — more games of the same mixed kind only measure the mixture more precisely. It is the deepest reason the data must be read coarsely: not only is each number imprecise, the thing the number is precise about is not quite the thing you asked.

The second is overlap. A faction’s win rate is not a clean reading of that faction’s strength, because players are not assigned factions at random. Stronger players gravitate to stronger factions, and a faction’s rate is lifted by the skill of the people who chose it. This is not a guess — it is exactly what the Archive’s ELO system was built to measure, and the gap between a faction’s raw rating and its skill-adjusted rating is the size of the effect, in numbers. Detachment data makes it worse, not better: the strongest players find the strongest detachment first, so a detachment’s rate is confounded twice over. Win rate measures faction-and-player jointly; pulling the two apart needs player-level modelling, not a larger spreadsheet.

The third is reflexivity. A competitive scene that can read last week’s results will adapt to them — that is simply what “competitive” means. The faction that posted the best week becomes the faction everyone arrives prepared for; its next week is shaped by its last. So a win-rate table never describes a settled system; it describes a system already in motion away from the state it reports.

And here is the turn that matters for what follows: the “results” the scene reads are, more and more, the published statistics themselves — the rankings and rates that sites like this one put out. Players do not react only to what happened at the weekend. They react to what the trackers said happened. A stats site is not a neutral instrument pointed at the meta. It is one of the inputs to it.

Valuable, or entertainment?

So — the question the whole piece has been walking toward. Here I am giving you an editorial position, not a theorem.

40k statistics are valuable. But they are valuable at a specific, coarse resolution, and most of the disappointment people carry about them comes from reading them at a finer grain than they can bear. A tier list drawn from the data is honest work. “These five factions are over-represented at the top” is honest work. A claimed 1.4-point edge between two factions, quoted to the decimal and treated as a fact you could plan around — that is not analysis. It is entertainment, and not the harmless kind, because it is wearing analysis’s clothes.

And because a stats site is an input to the meta and not only an observer of it, its errors do not stay errors. Publish a false positive — a faction or a detachment that looks strong and is not — and players read it and move there; the spurious number becomes a real migration of armies across real tables. Publish a false negative — a genuinely strong option buried because its thin sample dipped — and players stay away from it “because the numbers don’t support it”; it stays under-played, its sample stays thin, and the false negative hardens into received wisdom. A wrong number on a stats site is not a private mistake. It bends the meta around itself, and the bent meta then reads back as confirmation.

That is why the resolution question is not pedantry. Getting it right is the difference between informing the meta and misdirecting it — and 11th edition raises that responsibility, it does not lower it: more cells, more comparisons, more false positives queued up to be published, and a matchup grid that will look more authoritative the emptier it actually is.

So: valuable, yes — but the value and the danger are the same instrument. Read at the grain it can bear, the data informs. Read finer than that, and published as though it were solid, it does not merely mislead one reader — it moves the board, and then hands you the moved board as proof it was right.

Data is infinite, understanding it is Divine — and understanding, most days, is knowing which digit to stop reading at.

— The Editor