Where Humans and Machines Diverge — Lab

Four challenges to try. After each, see how frontier AI does on the same task. The pattern across all four is the lab's point.

An interactive companion to Where humans and machines diverge — four short cognitive challenges. You try each one. After each, the lab reveals how today's frontier AI does on the same kind of task. The pattern across all four is the actual subject of the post.

Insightful AI World · Lab #4

Where humans and machines diverge

Four short challenges. You try each one. Then you see how today's frontier AI performs on the same kind of task — and on which axes the gap is largest.

Before you start

Read this short panel first. It tells you what the lab is, what it is trying to make you see, and how you will know if you got there.

🎯 Purpose

Four small cognitive tasks. You attempt each one and pick an answer. For each task, the lab then reveals — with published research figures — how current frontier AI systems perform on the same kind of task. The tasks are chosen to span the five axes of difference from the companion article: data efficiency, embodiment, transfer, common sense, and energy (the fifth is described after the challenges, since it cannot be played).

💡 What it is trying to make you see

That "is AI smarter than humans?" is the wrong question. The right question is "on which axis?" Some tasks make humans look like geniuses next to machines; other tasks make machines look like geniuses next to humans. Both verdicts are honest — they are answering different questions. The companion article explains this in words; the lab makes you experience the asymmetry on yourself.

✅ What you should understand after playing

After working through the four challenges, you should leave able to:

Name at least two tasks on which today's AI is reliably better than the average human, and two on which the average human is reliably better than today's AI.
Explain, for one of those tasks, why the gap goes that way — what about the underlying mechanism makes one side win.
Catch yourself wanting to ask "better on which task?" whenever someone says "AI is/isn't smarter than humans."

If those three are true for you when you leave, the lab did its job. If not, re-read the verdict notes inside each challenge and revisit the article's five axes.

How to use it — 30 seconds

Read each challenge and click the answer you think is correct.
Read the verdict. The lab reveals the correct answer, your accuracy, and how current AI does on the same task — with the source / approximate published figure.
Watch the scoreboard at the bottom. After all four, you will see your performance and AI's side-by-side across the four axes.

A note on the AI numbers

The "AI performance" numbers in each verdict are based on published benchmark results from 2023–2025 for the relevant frontier model class. They are best-effort summaries, not exact reproductions of any single run, and they are directional: the point is which side wins each task, not the precise percentage. Sources are cited in the verdict notes.

The four challenges

Pick the best answer for each. You can change your mind before clicking Reveal. After revealing, the next challenge unlocks.

Challenge 1 of 4 · Few-shot pattern

Three examples, then predict the fourth.

A simple rule transforms an input into an output. You see three examples. Apply the rule to the fourth input.


          ⬛⬜⬛ → ⬜⬛⬜

          ⬛⬛⬜ → ⬜⬜⬛

          ⬜⬛⬛ → ⬛⬜⬜

          ⬛⬜⬜ → ?

Verdict

Correct answer: A (⬜⬛⬛). The rule inverts each cell. Most humans see this rule after one or two examples; on the broader ARC-AGI-1 benchmark, the published human baseline sits around 85%.

Humans (ARC-AGI-1)

~85%

Direct-prompted GPT-4o (pre-2024)

~5%

OpenAI o3 high-compute (Dec 2024)

~87%

Frontier models on ARC-AGI-2 (2025)

~37%

→ Humans still ahead on the harder benchmark. Frontier AI closed most of the ARC-AGI-1 gap with very high compute, but ARC-AGI-2 re-opens the gap by design.

Source: ARC Prize 2024 / 2025 (arcprize.org/blog/oai-o3-pub-breakthrough · arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025).

Challenge 2 of 4 · Speed arithmetic

Compute this in your head in under 5 seconds.

No calculator. No paper. Just guess if you have to.

47 × 89 = ?

Verdict

Correct answer: B (4,183). 47 × 89 = (47 × 90) − 47 = 4,230 − 47 = 4,183.

Untrained adults (5s, no paper)

low

Calculator / LLM with tool

100%

→ Machines win, decisively. Multi-digit arithmetic under time pressure is where biological cognition is at its worst and silicon is at its best — this is why nobody contests calculators.

The human bar is illustrative — peer-reviewed exact-accuracy figures vary by problem size and method. The point is the direction: humans are slow and error-prone here; a calculator is not.

Challenge 3 of 4 · Common-sense physics

A small scenario. Pick the most plausible outcome.

No trick. Just reason about the physical situation as you would for an everyday event.

A glass ball sits on a wooden table in a moving train carriage. The train slows down sharply but does not stop. The table is bolted to the floor. What happens to the ball?

Verdict

Correct answer: B. The ball continues moving forward (Newton's first law). The bolted table only affects the table, not the ball sitting on it.

Humans

~85%

Frontier LLMs (twist variants)

varies

→ Humans typically still ahead on novel-twist physics. Frontier LLMs score 80–90% on standard physical-reasoning benchmarks (PIQA-style) where the situations resemble training data, but accuracy drops noticeably on small twists the training distribution did not cover — the failure mode the article describes.

Source pattern: PIQA + recent multimodal physical-reasoning benchmarks; exact numbers depend on prompt phrasing.

Challenge 4 of 4 · Spot the AI

Which paragraph was written by an AI?

One of these was written by a human; the other was generated by a frontier LLM in 2024. Pick the AI-written one.

A. The first time I tasted a real tomato, picked from my grandmother's garden in August, I understood why my mother used to laugh at the things they sold in supermarkets. It wasn't even close. The supermarket ones tasted of nothing — just a vague wateriness wrapped in red skin. The garden ones tasted of summer, somehow, all at once.

B. Tomatoes, in their freshest form, offer a depth of flavor that supermarket varieties often lack. Hand-picked from a home garden, they reveal a complexity that combines sweetness, acidity, and an almost herbal undertone. Many find this difference striking, especially when comparing them with the more uniform produce found in commercial settings.

Verdict

Correct answer: B. The tell-tale signs of frontier LLMs in 2024-26: "in their freshest form," "offer a depth of flavor," "an almost herbal undertone," "more uniform produce found in commercial settings." Polite, balanced, generic, sentence-rhythm even. Paragraph A has the texture of personal recall — a specific grandmother, an August date, a specific embarrassment.

Humans

~55%

AI detectors

~65%

→ Effectively tied — both barely above chance. Humans are not reliable judges of AI-generated text in 2026, and automated detectors are not much better. This is one of the most consequential blurring axes.

Independent reviews (Bischoff 2023, OpenAI internal 2023) put human discrimination of GPT-4 text at 50-60 %.

Your scoreboard

After revealing all four challenges, this fills in. Your score on the left, AI's verdict on the right, axis labelled.

ChallengeYouAIWinner

Few-shot pattern learning———

Speed multi-digit arithmetic———

Common-sense physics———

Spot the AI text———

Lab #4. Companion to Human vs machine intelligence: where they actually differ. Feedback welcome.