Generalizing the agentic experiment loop
Inspired by Karpathy's autoresearch and Shopify Liquid PR #2056
Two projects independently discovered the same loop:
Agent modifies train.py, trains for 5 min, checks val_bpb, keeps or discards. ~100 experiments overnight.
Agent modifies Ruby source, runs tests + benchmarks, keeps or discards. ~120 experiments. Result: 53% faster.
The human writes the process. The agent writes the code.
Part config, part prompt. Describes what to change, how to measure, and when to stop.
# autoimprove: make-it-faster
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: go test -bench=. -benchmem
score: ns/op:\s+([\d.]+)
goal: lower
guard: allocs/op: ([\d.]+) < 500
keep_if_equal: true
timeout: 3m
## Stop
budget: 4h
stale: 15
## Instructions
Reduce allocations in hot paths. Try buffer reuse, fast-path patterns.
You describe what to optimize. The agent resolves it to specific files.
## Change
scope: the template parsing engine
exclude: test/, benchmark/
Resolved scope "the template parsing engine" to:
- lib/liquid/parser.rb
- lib/liquid/lexer.rb
- lib/liquid/variable.rb
These are the ONLY files that will be modified. Confirm? [y/n]
exclude prevents the agent from grading its own homework.
Correctness gate. Must pass for any experiment to be kept. Generated by goal-aware bootstrap.
The metric to optimize. Extracted from stdout via convention, regex, or jq.
Secondary metrics that must not regress. Prevents improving speed by breaking reliability.
test: go test ./... # gate
run: go test -bench=. -benchmem # score
score: ns/op:\s+([\d.]+)
guard: allocs/op: ([\d.]+) < 500 # guard
The optimization goal predicts what the agent will break.
| Goal | Agent will try to... | Tests guard against... |
|---|---|---|
| Faster | Skip work, remove checks | Edge cases, unicode, nil, concurrency |
| Smaller | Remove things, swap deps | Features still work, runtime deps present |
| More accurate | Overfit, leak data | Data leakage, reproducibility, valid outputs |
| Better RAG | Game retrieval, stuff context | Format consistency, hallucination, empty results |
| Lower cost | Downsize, cut redundancy | Load handling, failover, durability |
/autoimprove bootstrap --generate
One command. The agent detects what's missing and walks you through it.
| Type | What it optimizes | Typical metric |
|---|---|---|
| perf | Code performance | ns/op, req/sec, allocations |
| ml | ML training | val_bpb, loss |
| automl | Tabular ML | AUC-ROC, F1 |
| rag | RAG pipeline | answer relevancy, faithfulness |
| docker | Container size | image bytes |
| k8s | Cluster health | running pod count |
| prompt | Prompt quality | F1, accuracy |
| sql | Query performance | execution time |
| frontend | Bundle size | bundle bytes |
| ci | Build speed | build time |
/autoimprove init --type rag
# autoimprove: faster-checkout-api
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: hey -n 1000 http://localhost:8080/checkout
score: Requests/sec:\s+([\d.]+)
goal: higher
guard: latency_p99: ([\d.]+) < 500
timeout: 3m
## Stop
budget: 4h | stale: 15
## Instructions
Try query batching, connection pooling, response caching.
Don't change the API contract or add dependencies.
# autoimprove: better-rag-answers
## Change
scope: the RAG pipeline — chunking, retrieval, generation
exclude: data/, eval/
## Check
test: python -m pytest tests/test_pipeline.py -x
run: python eval/run_eval.py
score: answer_relevancy: ([\d.]+)
goal: higher
guard: error_rate: ([\d.]+) < 0.1
keep_if_equal: true
timeout: 5m
## Stop
budget: 6h | target: 0.92
## Instructions
Try: chunk size tuning, hybrid search, cross-encoder reranking,
query expansion, chain-of-thought generation.
# autoimprove: better-churn-model
## Change
scope: the training pipeline
exclude: data/, evaluate.py
## Check
test: python -m pytest tests/ -x
run: python train.py && python evaluate.py
score: auc_roc: ([\d.]+)
goal: higher
guard: f1_score: ([\d.]+) > 0.6
timeout: 3m
## Stop
budget: 4h | target: 0.95
## Instructions
Try: ratio features, rolling aggregates, target encoding,
XGBoost vs LightGBM vs CatBoost, model stacking.
Goes beyond AutoML: the agent can engineer features, rewrite preprocessing, and try novel model combinations.
Applied to a RAG search engine: hybrid search over 44K chunks, 301 documents, 20-query golden set.
| # | Experiment | Result |
|---|---|---|
| 1 | Fix keyword query sanitization (special chars crashed) | kept +0.35% |
| 2 | Adjust hybrid weights (0.4/0.6) + fetch limit | discarded |
| 3 | Replace weighted merge with Reciprocal Rank Fusion | kept +3.6% |
| 4 | Boost results appearing in both retrieval lists | discarded (0%) |
| 5 | Limit max 2 results per source for diversity | kept (equal) |
| 6 | Lower RRF constant (k=30) + 5x fetch | discarded -2.1% |
Applied the improved protocol (per-experiment commits, guards, keep_if_equal, supersedes).
| # | Experiment | Result |
|---|---|---|
| 7 | Boost results matching query in source name | discarded -4.8% |
| 8 | OR-mode keyword search (broader recall) | kept +1.7% |
| 9 | Fetch limit 5x | discarded |
| 10 | Better dedup key (120 chars vs 50) | kept (equal) |
| 11 | BM25 column weights (content=10x, name=2x) | kept +0.6% |
| 12 | Max 1 result per source | discarded |
| 13 | Query-text overlap boost after RRF | discarded -3.4% |
| 14 | Fetch limit 4x (supersedes #9) | kept +2.7% |
Changed "all words must match" to "any word matches." A chunk about "growth loops" now surfaces for "how to run growth experiments" even without the word "experiments." RRF handles the noise: results matching both keyword AND semantic score highest.
Keyword matches in content (weight 10) now score 5x higher than matches in metadata (weight 2). Previously a keyword hit in the "chunk_id" column scored the same as a hit in the actual text.
With better keyword search (OR-mode + BM25 weights), more candidates are relevant. 4x fetch gives RRF a larger pool without drowning in noise. Guest hit rate went from 35% to 40%.
Replaced weighted score merge with rank-based fusion. Avoids the normalization problem of comparing BM25 scores with cosine similarity. Single biggest improvement (+3.6%).
Building the golden set and eval script took longer than running 14 experiments. Most codebases don't have a measurable score out of the box.
Goal-aware bootstrap caught a crash on hyphenated queries before optimization started. The agent never would have found it by tuning scores.
Round 1 skipped this and lost rollback ability. Round 2 committed every experiment. Clean git reset on every discard.
Fetch 5x failed in round 1 but 4x succeeded in round 2 because OR-mode keyword search changed the quality of candidates. Context matters.
The skill runs in Claude Code. The protocol runs anywhere.
/autoimprove # Claude Code (interactive)
claude -p "run /autoimprove on improve.md" # headless overnight
/autoimprove --export # generates program.md
codex -p "follow program.md" # any agent can follow it
gemini -p "follow program.md"
| AutoML | autoimprove | |
|---|---|---|
| Search space | Predefined grid | Open-ended |
| Changes | Numeric knobs | Rewrite code, try new algorithms |
| Strategy | Bayesian optimization | AI reasoning |
| Scope | ML hyperparameters | Any domain with a measurable score |
| Ceiling | Best from your grid | Unbounded |
The pattern works anywhere you have: a file to change, a command to run, and a number to improve.
/autoimprove # that's it — setup is auto-guided