Background
When people talk about 4th and 5th generation models, it's worth distinguishing "agents" — the actual systems we build by wrapping a model in tools, memory, and a goal-seeking loop — from the "agentic nature" of the models themselves, which is the model's own built-in capacity to plan, choose tools, and self-correct. The newer generations are interesting precisely because that agentic ability is increasingly baked into the model rather than scaffolded by us, so a much simpler agent can now do far more on its own.
What's struck me most is how much the craft of prompting has changed between 2025 and 2026, when these 4th generation models really landed. Back in 2025 we were still writing long, carefully engineered prompts — spelling out the role, the steps, the format, the edge cases — because the models needed that scaffolding to stay on track. Now much of that has fallen away: you increasingly just state the goal and the constraints and let the model handle its own planning, tool-selection, and self-correction. The skill has shifted from "instructing" to "delegating" — less about dictating every step and more about clearly framing intent, giving good context, and knowing when to check the model's reasoning rather than micromanage it. Ironically, the better the models get, the less you need to know about how they work, and the more it becomes about knowing how to talk with them.
If you've felt that AI models are improving faster than you can keep track of, you're not imagining it. In roughly two years we've moved through what I think of as four distinct generations of large language models — and the change isn't just "they got smarter." It's a story about reasoning, context, multimodality, and increasingly, autonomy. Here's how the major labs stack up, and where the costs have landed.
A quick note on method: the "generation" labels below are my own pragmatic mapping, not official industry terms — each lab versions its own way. The earlier generations reflect real released models and pricing; the 4th-generation column is an informed projection, so treat those figures as directional rather than gospel. All pricing is per 1 million tokens, shown as input / output.
| Lab | 2nd Gen (~2022–23) | 3rd Gen (~2024–25) | 4th Gen (~2025–26, projected) |
|---|---|---|---|
| OpenAI | GPT-3.5 Turbo — chat & basic reasoning, text-only, short context. ~$0.50 / $1.50 | GPT-4o / GPT-4 Turbo — strong reasoning, native multimodal, 128K context. ~$2.50–5 / $10–15 | GPT-5-class — deep reasoning by default, strong agentic behaviour. ~$3–6 / $12–20 |
| Anthropic | Claude 1 / 2 — capable text, large context for its era, safe outputs. ~$8 / $24 | Claude 3 / 3.5 — top-tier reasoning & coding, vision, 200K context. Sonnet ~$3 / $15; Haiku ~$0.25 / $1.25 | Claude 4-class — leading agentic & coding capability, sustained autonomy. ~$3–15 / $15–75 by tier |
| PaLM / PaLM 2 — general text, early multimodal, moderate reasoning. ~$0.50–1 / $1.50–2 | Gemini 1.5 / 2.0 — huge context (1–2M tokens), multimodal, cheap "Flash" tier. Pro ~$1.25–2.50 / $5–10; Flash ~$0.075 / $0.30 | Gemini 3-class — very large context, tight agent integration. ~$2–4 / $8–12 (Pro) | |
| DeepSeek (China) | DeepSeek LLM / Coder — early open-weight text & code, modest context. ~$0.15–0.30 / $0.30–0.60 | DeepSeek V3 / R1 — near-frontier reasoning, open-weight, 128K context. V3 ~$0.27 / $1.10; R1 ~$0.55 / $2.19 | DeepSeek V4 / R2-class — improved reasoning & agentic behaviour, open-weight. ~$0.30–0.70 / $1.20–2.50 |
Four things the table is really telling you
Capability compounds, generation over generation. Each step adds the same four ingredients in greater measure: deeper reasoning, richer multimodality (text, image, audio, video), longer context windows, and — most significantly in the latest generation — more native agentic behaviour. That last point is the one to watch. Earlier models needed you to script every step; newer ones can plan, choose their own tools, and self-correct with far less hand-holding.
The real price story is performance, not the sticker. Look only at flagship per-token prices and you'd think costs have barely moved. But the cost per unit of useful work has collapsed. Today's cheap tiers — Haiku, Flash, mini — do what last year's flagships did, and context windows have ballooned from a few thousand tokens to over a million. You're getting dramatically more capability per dollar, even when the headline number looks flat.
Every line has split into "frontier" and "fast." By the 3rd generation, each lab offered a top-tier model for hard problems (Opus, GPT-4o, Gemini Pro) and a fast, cheap workhorse for everyday volume (Haiku, mini, Flash). Choosing the right tier for the task is now one of the highest-leverage decisions you can make on cost.
DeepSeek changed the economics for everyone. The standout disruptor is DeepSeek, whose models delivered reasoning approaching the Western frontier at roughly a tenth to a twentieth of the price — and shipped as open weights you can self-host. That combination pressured the entire market on price-performance and made genuinely capable AI cheap to run. One practical caveat worth keeping in mind: as a Chinese-developed model, DeepSeek carries content-moderation and data-governance considerations on sensitive topics, though self-hosting the open weights mitigates much of this.
The bottom line
The headline of the last two years isn't simply that models got smarter — it's that capability got cheaper, longer-context, and increasingly autonomous. The practical skill is shifting accordingly: less about engineering elaborate prompts, more about clearly framing your intent and picking the right model for the job. The better these systems get, the less you need to understand how they work under the hood — and the more it pays to know how to talk with them.
For more information. Opus 4.8 Just Dropped. Here's How To Actually Use It or the discussion of the 4 MIT guys on the future of AI Opus 4.8 Drops, Demis Hassabis Predicts AGI, and the $220B Foundation | EP #260




