Blog 03

When Architecture Becomes Infrastructure

When Architecture Becomes Infrastructure

Frontier models no longer agree on how to handle attention, yet they are converging on the same design goals. The decisive gains, meanwhile, increasingly come not from network design but from how models are trained.

For several years, the same question kept resurfacing in AI. Which new architecture would finally unseat the Transformer? Between 2023 and 2024, a string of models promising linear-time scaling, among them Mamba, RWKV, and RetNet, arrived one after another, each greeted as a possible end of the Transformer era. By 2026 the question has largely gone quiet — but not because a successor won, and not quite because everyone settled on a single design. Something subtler happened: architecture stopped being where the contest is decided. It has become infrastructure — necessary, expensive, actively engineered, and yet no longer the main source of the gap between models. That gap now lives in training: in data, in reinforcement-learning pipelines, and in the tools a model can call. The deeper pattern, which the second half of this essay turns to, is about which kinds of elegance survive contact with massive compute, and which get washed out.

Converging on goals, diverging on mechanisms

Two long-standing pressures shape everything. The first is the cost of attention over long contexts. Standard self-attention scales quadratically in both compute and memory with sequence length, and once a context stretches to hundreds of thousands or millions of tokens, the key-value (KV) cache alone can exhaust a large GPU cluster. The second is the cost and energy of dense computation. Every token activates the full set of parameters, so a model's growth runs up against both bandwidth and power. Nearly everything interesting in frontier architecture over the past two years is a response to these two pressures.

What is striking is that the responses have not converged on a single mechanism. On the long-context problem alone, the flagship releases of early 2026 split into at least three camps:

Model (early 2026)Long-context approachLinear : full ratio
Qwen3.5Gated DeltaNet linear layers + full attention3 : 1
Ling / Ring 2.5 (1T)Lightning linear attention + MLA7 : 1
Hunyuan-TurboSMamba2 layers + GQA attention≈ 8 : 1
GLM-5MLA + DeepSeek-style sparse attention (DSA)
DeepSeek V4Compressed sparse attention (CSA + HCA)
Kimi K2.5MLA, full attention throughout
MiniMax M2.5Plain full multi-head attention

One camp bets on linear hybrids: a backbone of linear-complexity sequence layers — state-space models in the Mamba family, or variants of gated linear attention — interleaved with a small number of full-attention layers that handle precise retrieval and harder reasoning. Alibaba's Qwen3.5 alternates Gated DeltaNet and full attention at roughly three to one; Ant Group's Ling 2.5 rebuilds a trillion-parameter model around a seven-to-one mix of Lightning linear attention and the Multi-head Latent Attention (MLA) it borrowed from DeepSeek; Tencent's Hunyuan-TurboS runs a Mamba2–Transformer hybrid in production at 560 billion total parameters. Notably, even within this camp the "right" ratio of linear to full layers has not settled — 3:1 in the Qwen3-Next lineage and Kimi Linear, 7:1 at Ling, roughly 8:1 at Tencent — and seems to depend on the expressiveness of the linear mechanism and on what you evaluate. A second camp keeps attention but makes it sparse or compressed: GLM-5 layers DeepSeek-style sparse attention on top of MLA, and DeepSeek's own V4 pairs two compressed-attention mechanisms to cut the FLOPs and KV cache of million-token inference to a fraction of its predecessor's. And a third camp simply pays for full attention — Kimi K2.5 with MLA, MiniMax M2.5 with plain multi-head attention — on the view that reliability in long reasoning is worth the memory bill. One survey of the February releases was titled, aptly, "Nobody Agrees on Attention Anymore."

The most instructive episode here is a reversal. MiniMax's M1 was among the most aggressive linear designs, interleaving seven Lightning Attention blocks for every softmax block. Its successor, M2, went back to full attention — and the lab's pretraining lead published an unusually candid postmortem explaining why. Linear attention's regressions surface precisely in the regimes that now matter most, long-chain reasoning and multi-step agentic work, and current evaluation suites are poor at catching them; and although the theoretical compute crossover with full attention sits at only a few thousand tokens, actually reaching it requires low-precision state storage, prefix caching, and a stack of infrastructure that linear attention does not yet have. The team remains bullish in the long run — once context length grows faster than GPU capacity, the linear and sparse payoff gets unlocked — but for now the costs outweighed the benefit. The point of the episode is not that linear attention "lost"; Ant shipped a trillion-parameter linear hybrid in the same quarter. The point is that the choice among linear, sparse, and full attention has become a cost-and-infrastructure calculation that different labs, facing different workloads and serving economics, can rationally resolve in different directions. That is what a technology looks like once it is infrastructure rather than moat.

Where genuine mechanical convergence does exist is in sparsity. Almost every flagship model is now a mixture-of-experts (MoE), and the fraction of parameters activated per token keeps falling. DeepSeek-V3 activates only about 5.5% of its parameters on each forward pass; Meta's Llama 4 Maverick, about 4.3%; Kimi K2.5, about 3.2%. DeepSeek V4-Pro activates roughly 49 billion of its 1.6 trillion parameters at a time, and Alibaba's Qwen3-Coder-Next, with 80 billion parameters in all, activates just 3 billion, yet outperforms models several times its active size on coding tasks. Mechanisms such as Mixture-of-Depths, which routes easy tokens through fewer layers, loosen the link between compute and parameter count even further.

Newer components are still being folded into the frame. One concerns how models reason. By default, today's models behave like fast, intuitive responders, and the "slow thinking" that emerged over the past year relies mainly on chains of thought, writing each reasoning step out as text. A newer idea, implicit reasoning, lets the model iterate in latent space instead of emitting its intermediate steps as text, scaling the compute spent on reasoning without lengthening the output. The early evidence spans several fronts: Meta's COCONUT trains models to reason in a continuous latent space rather than in tokens; recurrent-depth models scale test-time compute by looping layers; and in robotic control, iterating a single action module a few times can sharply raise success rates. Low-precision training, too, is no longer confined to deployment. NVIDIA has reported pretraining a 12-billion-parameter model on ten trillion tokens in the 4-bit floating-point format NVFP4, reaching loss and downstream accuracy close to an FP8 baseline — the largest 4-bit training run disclosed so far.

So the convergence is real, but it lives one level above where it is usually described. What the frontier shares is not a mechanism but a set of design goals — sub-quadratic cost over long contexts, sparse activation, ever-lower numerical precision, more compute folded silently into the forward pass — plus the engineering discipline to hit them. Below that level the mechanisms still differ, and labs move between them as the economics shift. Convergence at the level of goals, fluidity at the level of mechanisms: that is roughly what it means for architecture to become infrastructure.

Where the gains are coming from

If architecture were still the differentiator, the mechanism splits above should show up as capability gaps. Mostly, they don't: linear hybrids, sparse-attention models, and full-attention models sit interleaved on the leaderboards that matter. Consider instead the most striking jumps in capability over the past year — OpenAI's o-series and the GPT-5 models that followed, DeepSeek's R1, and agents such as Kimi K2.6 that lead on agentic coding and tool-use benchmarks and can carry out multi-step tasks on their own. Most of these owe little to new network structure. They come from a set of training methods: reinforcement learning against verifiable rewards, more compute spent at inference time, calls to external tools and code execution, and the mixing and cleaning of training data.

The point is clearest in the gap between open and closed models. When DeepSeek released V4-Pro in April 2026 — 1.6 trillion total parameters, 49 billion active — its reported 80.6 on SWE-bench Verified sat 0.2 points behind Claude Opus 4.6, and its distance from the GPT-5 series and Gemini 3.1 Pro was similarly small. (These are vendor-reported numbers on a leaderboard that turns over monthly; the specific models will be stale by the time you read this. The shape of the gap is the point.) What closed that gap is data and the RL pipeline, not a change in the underlying network. One honest caveat belongs here: the closed labs do not publish their architectures, so the claim that training rather than architecture closes the open–closed gap rests partly on the assumption that the closed models are architecturally unexotic. That assumption is widely shared, and consistent with what serving costs and latencies suggest, but it is an assumption rather than an observation.

The return of symbolic reasoning is a telling example of the same shift. For years, a revival of "neuro-symbolic" methods was seen as a likely route to stronger, more rigorous reasoning, and the usual vision was to build a dedicated symbolic module into the network. That capability is indeed improving, but not in the way it was imagined. A model gains reliable reasoning and self-checking by calling a code interpreter or an external verifier and then doing reinforcement learning against what the verifier returns. The symbolic part, in other words, enters through a tool interface rather than through the architecture.

A cleaner way to put the division of labor: architecture now buys efficiency, and training buys capability. The Ling 2.5 numbers are throughput numbers — Ant reports more than triple the generation throughput of its predecessor beyond 32K tokens, and a widening advantage over the Kimi K2 architecture as outputs grow longer. The V4-Pro numbers that close the gap with the closed models are RL-pipeline numbers. The two are coupled, of course: whether a lab can afford reinforcement learning over million-token agentic trajectories depends directly on what attention costs it, which is much of why the efficiency war is worth fighting at all. But the coupling runs through the budget, not through the benchmark. Architecture in 2026 is necessary the way power and cooling are necessary. It is no longer where the contest is decided.

Where elegance and compute pull apart

A related pattern is worth dwelling on. Many mathematically elegant methods see their theoretical advantage diluted, in practice, by the sheer scale of compute and resources. This is not a new observation. In "The Bitter Lesson," Richard Sutton argued that over the long run, general methods that make better use of computation tend to beat methods built on human insight and clever structure. Sara Hooker's "The Hardware Lottery" pushed the point further, arguing that whether a research idea succeeds depends largely on whether the hardware of the day happens to suit it, rather than on the idea's intrinsic merit.

Still, "elegance always loses to brute force" would be too crude a summary. What gets diluted is mainly the kind of elegance that hard-codes human knowledge, priors, and structure into the model. Given enough data and compute, the model tends to learn those regularities on its own, and the hand-designed constraints can turn into a liability. The other kind of elegance, the simple and regular sort that is well matched to parallel matrix computation, is amplified by scale rather than worn down. The Transformer is itself an instance of the latter. It won precisely because its structure fits what GPUs do well. A more precise way to put it is that elegance aligned with compute gets amplified, while elegance at odds with compute gets washed out.

Several of the developments already mentioned bear this out. State-space models are theoretically elegant; their continuous-time formulation and the HiPPO framework give a principled account of long-range memory. Yet by Mamba-2, the designers deliberately gave up some of that expressiveness to train faster on GPUs, trading away part of the theoretical refinement for raw speed. MiniMax's retreat to full attention is the same pattern seen from the other side: the elegant mechanism was not beaten on theory but on kernels, caches, and numerics. Low-precision training tells the story again. The cleanest version of the idea is 1-bit BitNet, which remains a research curiosity. What actually trains stably at frontier scale is 4-bit NVFP4, which leans on a set of distinctly inelegant engineering tricks, among them Random Hadamard transforms, stochastic rounding, and keeping a few layers in high precision. The clean idea survives, but in a diluted, heavily patched, engineered form.

None of this means theory has lost its value; rather, its role has shifted. In many cases the elegant mechanism itself is replaced, but its value as a conceptual frame survives. State-space theory was simplified, yet it seeded the entire line of hybrid architectures. And some methods may hold a real theoretical edge that only appears at scales we cannot yet afford, closer to "too early" than to "wrong." On that reading, dilution looks less like an iron law than a feature of the current, compute-rich moment; once the easy gains from cheap compute are spent and the low-hanging improvements run out, theory may again become what separates the leaders.

What is still open

None of this means architecture is fully settled. Most of the directions that remain unresolved are refinements of the existing frame rather than challengers to it — with one possible exception at the end.

One cost of implicit reasoning is interpretability. A chain of thought written out as text is often called inefficient, yet its very visibility is valuable, since researchers can apply reinforcement learning to an observable reasoning process, check it step by step, and watch for anomalies. The latent iteration in implicit reasoning buries that process in high-dimensional representations that cannot be read off directly. Work has already begun on "decoding" the internal states of such models, treated as a safety question. For that reason, the more cautious expectation is that implicit reasoning will serve as a complement to explicit reasoning rather than a replacement for it.

Multimodality and world models are a second area still in motion, and a heavily funded one. DeepMind's Genie 3 generates interactive 3D worlds in real time from a prompt and keeps them visually consistent for minutes at a stretch; Meta's V-JEPA 2, trained on roughly a million hours of video plus tens of hours of robot interaction, reaches high zero-shot success at manipulation in unfamiliar environments; and Yann LeCun left Meta in early 2026 to found AMI Labs, a lab devoted to general-purpose world models. Even so, it would be a mistake to read this as text being sidelined while world models take over. Vision-only world models hit a wall of their own. From video alone they struggle to distinguish actions with similar intent (pretending to twist something versus actually twisting it) and need other modalities or physical reasoning to resolve the ambiguity. V-JEPA 2's strongest visual reasoning, in fact, appears when it is paired with a language model. On the evidence so far, this looks more like a fusion of modalities than one displacing another. Relatedly, the input side is trying to shed the fixed tokenizer, with some work moving to byte-level, end-to-end processing, though that remains largely a research direction for now.

Generation itself has alternatives to autoregression. Diffusion language models drop the token-by-token, sequential decoding and instead recover a whole passage from noise in parallel, sidestepping the latency and the KV-cache bottleneck that autoregression carries with it. In early 2026, Inception released Mercury 2, the first diffusion model aimed at reasoning, with generation speeds of more than a thousand tokens per second, roughly ten times a comparable autoregressive model; Google's Gemini Diffusion and ByteDance's Seed Diffusion are pushing in the same direction. For now, though, pure diffusion models still trail autoregressive ones clearly on reasoning benchmarks, come close only on tasks like code generation, and grow expensive on long outputs. So they are likelier to arrive in practice first as hybrids, pairing block-wise diffusion with autoregression, than to replace it outright.

The candidate most likely to drag architecture back from infrastructure to battleground, though, is memory. Today's models are frozen after training and amnesiac between sessions; long contexts and retrieval are workarounds, not solutions. The architectural proposals — memory layers that read and write to large learned key-value stores, Titans-style test-time memorization, layers whose weights update during inference — aim at something training methods alone cannot easily bolt on: a model that accumulates experience. Persistent memory is the rare open problem that looks architectural all the way down. If the contest over network design reopens anywhere, it is likelier to be here than in another attention variant.

Taken together, the signs point to an architecture that has entered a stable phase — stable in its goals, if not in every mechanism — and is becoming shared infrastructure. What now separates one model from another is increasingly the training data, the reinforcement-learning pipeline, and the tools a model can reach, which also happen to be the parts each lab discloses least, and the high ground everyone is now fighting over.

Citation

If you need to cite this post, please use:

Bojian Yin. (Jun. 09, 2026). When Architecture Becomes Infrastructure [Blog post]. Retrieved from https://byin-cwi.github.io/MatrixWeb/posts/when-architecture-becomes-infrastructure.html

@online{matrixweb-when-architecture-becomes-infrastructure,
        title={When Architecture Becomes Infrastructure},
        author={Bojian Yin},
        year={2026},
        month={Jun},
        url={\url{https://byin-cwi.github.io/MatrixWeb/posts/when-architecture-becomes-infrastructure.html}},
}

返回 Blog