Trust, but reproduce.

The verification layer for frontier machine learning.

A practitioner-only publication for working ML and agent engineers. We run the paper, score the reproduction, and ship the minimal code — so you know what holds up before you load it into your work.

Essays

The Harness Is the Product: Why Agent Evals Are the Real Moat

Swapping the frontier model rarely moves your agent's success rate as much as fixing retries and context management — and the one thing competitors can't clone is your evaluation environment. A thesis on why agent evals, not weights, are where reproducible capability accrues.

CHECKPOINT 0013 · 2026-06-27

Signals — the pulse

All signals →

Reproduction Watch

The tracker →

Explainers — teardowns

All explainers →

Recreations — from scratch

All recreations →

Libraries — the toolshed

All libraries →

Essays — the horizon

All essays →