Building, Measuring, and Using AI Scientists

Andrew White






EdisonScientific
GTC[S81694]
March 2026

EdisonScientific

  • Spinout from FutureHouse formed in 2025
  • AIxBio Research Lab
  • Based in San Francisco
  • 45 employees

Science is changing independent of AI


Arxiv.org,10.6084/m9.figshare.17064419.v3

Number of Researchers are Growing

International R&D spending

PhD Researchers


NSF - https://ncses.nsf.gov/pubs/nsf24332; UNESCO UISI SDG9

Intellectual bottlenecks are growing


📝 Increasing paper count ($\approx$10M per year)

🧬 Larger data sets from cheaper experiments (genome at $200 per person, $1 / GB of sequencing)

🔍95% decline in disruptive papers since 1980

Park, M. et al. Nature 613, 138-144 (2023); Scannell, J.W. et al. Nat. Rev. Drug Discov. 11, 191–200 (2012); Deloitte 2025: Pharma innovation returns.

Edison Scientific Mission


Building an AI Scientist

What is an AI Scientist?


An AI Scientist is a system whose input is a general direction of discovery and whose output is experimental results, analysis, and a paper describing a novel discovery.

LLM model (knowledge, text generation)
Agent model + tools + task (can take actions)
Co-Scientist agent + conversation + long-running (human-in-the-loop)
AI Scientist autonomous + long-running + can execute experiments (novel discoveries)
LLM GPT-5.21, GLM 4.52
Agent Deep Research3, ChemCrow4
Co-Scientist Google Co-Scientist5, Biomni6
AI Scientist Kosmos7

1OpenAI, 2025. 2Zhipu AI, 2025. 3OpenAI, 2025; arXiv:2312.07559. 4Bran et al., Nat. Mach. Intell., 2024. 5Gottweis & Natarajan, Google Research, 2025. 6biomni.stanford.edu. 7arXiv:2511.02824, 2025.

Benchmarking an AI Scientist

ARC (2018)

Devil facial tumor disease (DFTD) is a disease that is decimating the population of Tasmanian devils. The disease passes from one animal to another through bites and is caused by parasites. The parasites cause cancerous tumors that spread throughout an infected animal's body and kill it. What is the best description of DFTD?

  1. a non-infectious, cell-cycle disease
  2. an infectious, cell-cycle disease
  3. a non-infectious, chronic disease
  4. an infectious, chronic disease

MMLU (2020)

A frameshift mutation is created when

  1. telomeric sequences are removed from DNA
  2. a codon's nucleotide sequence changes so that it calls for production of a different amino acid than the original one
  3. a base pair is either inserted or deleted in a gene
  4. a codon's nucleotide sequence is changed so that instead of coding for a given amino acid it acts to terminate translation

LAB-Bench (2024)

Approximately what percentage of Drosophila with a H3.3K36R mutation finish developing and enclose?

  1. 80%
  2. 19%
  3. 50%
  4. 37%
  5. 6%
  6. 94%

HLE (2025)

In a bioinformatics lab, Watterson's estimator (θ) and π (nucleotide diversity) will be calculated from variant call files. Will these calculations be biased if we are aiming to measure diversity of a whole population?

  1. Only Watterson's estimator (θ) is biased.
  2. Only π (nucleotide diversity) is biased.
  3. Both are biased.
  4. Neither is biased.
  5. None of the other answers are correct.

LABBench2 (2026)

Deletion of which residues from C. elegans protein COSA-1 would most likely affect the ability of COSA-1 to recruit MSH5 and ZHP3?

Residues 31–40

Snapshot of Current Models

Training Agents

How are agents trained?

  1. Pre-training — broad knowledge via next-token prediction
  2. Supervised Fine-Tuning (SFT) — learn from expert instruction–response pairs
  3. Reinforcement Learning (RL) — optimize through environment interaction with verifiable rewards

RL expands capabilities beyond what supervised data can teach

Verifiable Rewards

Computational checks (code execution, tool outputs) produce objective reward signals — no human feedback needed

BixBench: verifiable bioinformatics tasks as RL training signal

Swanson et al., BixBench, 2025; Narayanan et al., Aviary, 2024; NVIDIA Technical Blog

BixBench Example

How many peripheral immune cell types show significant differential expression (adjusted p-value < 0.05) of SOCS3?

Learning vs Frontier Models

Training Curve

Making Good Hypotheses

Shallow

placeholder

Deep

placeholder

Wide: mechanism to target to drug

ROBIN: A Multi-Agent System for Automating Scientific Discovery

Ali Essam Ghareeb*, Benjamin Chang*, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White†, Michaela M. Hinks‡, Samuel G. Rodriques

Kosmos: An AI Scientist for Autonomous Discovery


Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P Shriver, Fang Cao, Asmamaw T Wassie, Jon M Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F Roberts, Sladjana Zagorac, Timothy C Orr, Miranda E Orr, Kevin J Zwezdaryk, Ali E Ghareeb, Laurie McCoy, Bruna Gomes, Euan A Ashley, Karen E Duff, Tonio Buonassisi, Tom Rainforth, Randall J Bateman, Michael Skarlinski, Samuel G Rodriques, Michaela M Hinks, Andrew D White
arXiv:2511.02824, 2025

How do you validate systems like this?

Work with external groups. Input is their experimental data. Three discoveries reproduced in unpublished work. Four novel discoveries.


Kosmos overview

independent expert annotation of task difficulty and correctness

Example Discovery: What kosmos found

Human validation

Kosmos Scale

  • 120 sandboxed envs with 32GB RAM/8 CPUs
  • 3,000 papers parsed and considered
  • 24-48 hours of run time
  • Generates up to 4TB of data

Challenges Ahead

  • Journals adapting to influx of AI-generated articles
  • How can we train end-to-end on sparse data?
  • What are the economics of these systems?
questions