Part IV: From LLM Wiki to the AI Scientist

Chapter 12: AI Scientist in Practice — AI/Robotics, Bio/Chem/Drug, Autoresearch

Written: 2026-04-28 Last updated: 2026-04-28

12.1 Two Existence Proofs, One Week Apart

In April 2026, two papers appeared within a week of each other.

Anthropic's AAR (Autonomous Alignment Research) [Anthropic, 2026]: 9 Claude Opus 4.6 instances ran alignment research autonomously over 5 days. Result: PGR (Progress Rate) 0.97 vs human baseline 0.23. Cost: approximately $22/hour. 9 agents working independently, integrating results, generating new hypotheses.

Karpathy's autoresearch [Karpathy, 2026]: 700 experiments optimizing GPT-2 training in 2 days. Validation loss 0.862415 → 0.858039 (-0.4%). An exploration that would take humans months, completed in two days by agents.

In one week, autonomous agents produced production-grade results in both alignment research (Anthropic) and ML research (Karpathy). This is the evidence that the AI Scientist is not "future talk" but "happening now."

12.2 Three Stages of Research Democratization

Post [Um, 2026] classifies AI-driven research into three stages:

S1: Document-bound Research

Already AI > human. Literature search, paper summarization, comparative analysis, review writing. Doable right now with Claude Code + LLM Wiki.

S2: In-silico Experiments

Experiments that complete inside a computer. ML training, simulation, data analysis. Karpathy's autoresearch and Anthropic's AAR are production examples of this stage.

S3: Physical Experiments

Requires real labs, real chemicals, real robots. Still early, but the direction is clear.

Figure 12.1: Three stages of research democratization — S1 document-bound (AI greater than human now), S2 in-silico (autoresearch and AAR are production examples), S3 physical experiments (self-driving labs emerging). illustration by author Gemini assisted

12.3 ML / Alignment Research — Autonomous Agents

Karpathy's Autoresearch

The design of [Karpathy, 2026] [Karpathy, 2026]:

Define the search space: GPT-2 training hyperparameter combinations (learning rate, batch size, architecture variants)
Agent loop: propose → run → evaluate → propose next
Record results: store all experiment configs and outcomes in the filesystem
Meta-harness pattern: Chapter 9's meta-harness is exactly this loop

Figure 12.2: Karpathy autoresearch loop — define search space, propose experiment, run and evaluate, record to filesystem. 700 experiments in 2 days; validation loss 0.862415 to 0.858039. illustration by author Gemini assisted

[Karpathy, 2026] trained a small GPT-4-sized model (nanochat) the same way in one day. Demonstrates the pattern's scalability.

Anthropic's AAR

Key finding from [Anthropic, 2026]: 9 agents working independently was more effective than a single long-running agent. "Diversity prevented premature convergence." PGR 0.97 is more than 4x the 0.23 achieved by human experts on the same task.

Common implication of both cases: agent loop + filesystem-based experience accumulation + meta-harness pattern is the core structure of research automation.

12.4 Medical AI — Clinical Task Benchmarks

Wu et al. 2026 (arXiv 2603.28589) [Um, 2026]'s Med-AI Bench:

19 clinical tasks
171 cases
Scope: diagnosis, treatment planning, medical report generation, clinical reasoning

Results: AI exceeded human baselines on multiple clinical tasks. The gap was largest in documentation-heavy tasks (report generation, coding).

Note: This benchmark demonstrates AI's potential in clinical settings, but actual clinical deployment requires separate regulatory, ethical, and validation processes. AI Scientist can support clinical research; that is a different question from replacing clinical judgment.

12.5 AI Co-Scientist — Google's Multi-Agent Research System

Google's AI Co-scientist (arXiv 2502.18864) [Um, 2026]:

Architecture: Multi-agent system on Gemini 2.0
Loop: Generate → Debate → Evolve + Elo tournament
Validated domains: AML (Acute Myeloid Leukemia) drug repurposing, liver fibrosis

Key innovation: agents debate and critique each other's hypotheses. An Elo tournament automatically selects the strongest hypotheses. New hypotheses are generated and validated without human expert feedback.

This is Chapter 9's dependency graph + multi-agent verification + meta-harness applied to the research domain.

Figure 12.3: AI Co-Scientist loop — Generate, Debate, Evolve. Nine agents propose hypotheses, critique each other, and an Elo tournament automatically selects the strongest. illustration by author Gemini assisted

12.6 Self-Driving Labs — Chemistry / Biology

From [Um, 2026] (summarized from Rachel Brazil's Nature feature):

Labs that autonomously perform chemical synthesis and screening
Humans design experiments; robots execute; AI analyzes results and proposes next experiments
Current applications: materials discovery, drug candidate exploration

This is the S3 stage of research democratization. S1 (document) and S2 (in-silico) are possible right now; S3 requires additional physical infrastructure and safety validation.

12.7 Scope Honesty — The Robotics Gap

This chapter's title includes "AI/Robotics," but the current corpus doesn't have sufficient AI Scientist examples specific to robotics. To be clear:

Covered: ML/alignment research (autoresearch, AAR), medical AI (Med-AI Bench), chemistry/biology (self-driving labs), AI Co-scientist (Google)
Not covered: Robotics-specific AI Scientist — this domain is in preparation (Part V: Robotics is forthcoming)

The AI Scientist in robotics involves a much more complex physical experimentation loop. Simulation enables S2, but real hardware experiments require separate safety validation frameworks. This book can be honest only up to this point.

12.8 Closing the Loop

Chapter 1 started with "knowledge externalization." The real cost of Claude→Codex migration isn't code changes — it's extracting knowledge locked inside model conversations into plain-text files like AGENTS.md, HANDOFF.md, TASKS.md.

Chapter 10's LLM Wiki extended that externalization to research knowledge. Chapter 11 turned daily activity into AI external memory. And this chapter shows the point where that external memory becomes the research loop itself.

Karpathy's trajectory shows this most clearly:

LLM Wiki (Chapter 10): "Put everything in markdown"
Autoresearch (Chapter 12): "Agents run experiments"
Same author, same scaffolding, one year apart

From external memory to autonomous research — that is the final destination of harness engineering.

References

Anthropic, "Autonomous Alignment Research (AAR)," 2026-04-14. [Anthropic, 2026]
Karpathy, Andrej, "Autoresearch," 2026. [Karpathy, 2026]
Karpathy, Andrej, "Autoresearch — Round 1 tweet," 2026. [Karpathy, 2026]
Karpathy, Andrej, "NanoChat," 2026. [Karpathy, 2026]
terryum, "AAR post," terryum-ai, 2026. [Um, 2026]
terryum, "Autoresearch post," terryum-ai, 2026. [Um, 2026]
terryum, "AI Co-scientist post," terryum-ai, 2026. [Um, 2026]
Wu et al., "Med-AI Bench," arXiv 2603.28589, 2026. [Um, 2026]
terryum, "Self-driving labs post," terryum-ai, 2026. [Um, 2026]
terryum, "Research democratization post," terryum-ai, 2026. [Um, 2026]
Google AI, "AI Co-scientist," arXiv 2502.18864, 2026. (primary source for [Um, 2026])