Where Humans and Machines Diverge — Lab
Four challenges to try. After each, see how frontier AI does on the same task. The pattern across all four is the lab's point.
An interactive companion to Where humans and machines diverge — four short cognitive challenges. You try each one. After each, the lab reveals how today's frontier AI does on the same kind of task. The pattern across all four is the actual subject of the post.
Where humans and machines diverge
Four short challenges. You try each one. Then you see how today's frontier AI performs on the same kind of task — and on which axes the gap is largest.
Before you start
Read this short panel first. It tells you what the lab is, what it is trying to make you see, and how you will know if you got there.
🎯 Purpose
Four small cognitive tasks. You attempt each one and pick an answer. For each task, the lab then reveals — with published research figures — how current frontier AI systems perform on the same kind of task. The tasks are chosen to span the five axes of difference from the companion article: data efficiency, embodiment, transfer, common sense, and energy (the fifth is described after the challenges, since it cannot be played).
💡 What it is trying to make you see
That "is AI smarter than humans?" is the wrong question. The right question is "on which axis?" Some tasks make humans look like geniuses next to machines; other tasks make machines look like geniuses next to humans. Both verdicts are honest — they are answering different questions. The companion article explains this in words; the lab makes you experience the asymmetry on yourself.
✅ What you should understand after playing
After working through the four challenges, you should leave able to:
- Name at least two tasks on which today's AI is reliably better than the average human, and two on which the average human is reliably better than today's AI.
- Explain, for one of those tasks, why the gap goes that way — what about the underlying mechanism makes one side win.
- Catch yourself wanting to ask "better on which task?" whenever someone says "AI is/isn't smarter than humans."
If those three are true for you when you leave, the lab did its job. If not, re-read the verdict notes inside each challenge and revisit the article's five axes.
How to use it — 30 seconds
- Read each challenge and click the answer you think is correct.
- Read the verdict. The lab reveals the correct answer, your accuracy, and how current AI does on the same task — with the source / approximate published figure.
- Watch the scoreboard at the bottom. After all four, you will see your performance and AI's side-by-side across the four axes.
A note on the AI numbers
The "AI performance" numbers in each verdict are based on published benchmark results from 2023–2025 for the relevant frontier model class. They are best-effort summaries, not exact reproductions of any single run, and they are directional: the point is which side wins each task, not the precise percentage. Sources are cited in the verdict notes.
The four challenges
Pick the best answer for each. You can change your mind before clicking Reveal. After revealing, the next challenge unlocks.
⬛⬜⬛ → ⬜⬛⬜
⬛⬛⬜ → ⬜⬜⬛
⬜⬛⬛ → ⬛⬜⬜
⬛⬜⬜ → ?
Verdict
Correct answer: A (⬜⬛⬛). The rule inverts each cell. Most humans see this rule after one or two examples; on the broader ARC-AGI-1 benchmark, the published human baseline sits around 85%.
→ Humans still ahead on the harder benchmark. Frontier AI closed most of the ARC-AGI-1 gap with very high compute, but ARC-AGI-2 re-opens the gap by design.
Source: ARC Prize 2024 / 2025 (arcprize.org/blog/oai-o3-pub-breakthrough · arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025).
47 × 89 = ?
Verdict
Correct answer: B (4,183). 47 × 89 = (47 × 90) − 47 = 4,230 − 47 = 4,183.
→ Machines win, decisively. Multi-digit arithmetic under time pressure is where biological cognition is at its worst and silicon is at its best — this is why nobody contests calculators.
The human bar is illustrative — peer-reviewed exact-accuracy figures vary by problem size and method. The point is the direction: humans are slow and error-prone here; a calculator is not.
Verdict
Correct answer: B. The ball continues moving forward (Newton's first law). The bolted table only affects the table, not the ball sitting on it.
→ Humans typically still ahead on novel-twist physics. Frontier LLMs score 80–90% on standard physical-reasoning benchmarks (PIQA-style) where the situations resemble training data, but accuracy drops noticeably on small twists the training distribution did not cover — the failure mode the article describes.
Source pattern: PIQA + recent multimodal physical-reasoning benchmarks; exact numbers depend on prompt phrasing.
B. Tomatoes, in their freshest form, offer a depth of flavor that supermarket varieties often lack. Hand-picked from a home garden, they reveal a complexity that combines sweetness, acidity, and an almost herbal undertone. Many find this difference striking, especially when comparing them with the more uniform produce found in commercial settings.
Verdict
Correct answer: B. The tell-tale signs of frontier LLMs in 2024-26: "in their freshest form," "offer a depth of flavor," "an almost herbal undertone," "more uniform produce found in commercial settings." Polite, balanced, generic, sentence-rhythm even. Paragraph A has the texture of personal recall — a specific grandmother, an August date, a specific embarrassment.
→ Effectively tied — both barely above chance. Humans are not reliable judges of AI-generated text in 2026, and automated detectors are not much better. This is one of the most consequential blurring axes.
Independent reviews (Bischoff 2023, OpenAI internal 2023) put human discrimination of GPT-4 text at 50-60 %.
Your scoreboard
After revealing all four challenges, this fills in. Your score on the left, AI's verdict on the right, axis labelled.
Lab #4. Companion to Human vs machine intelligence: where they actually differ. Feedback welcome.