From autoresearch
to autoimprove

Generalizing the agentic experiment loop

Inspired by Karpathy's autoresearch and Shopify Liquid PR #2056

The pattern

Two projects independently discovered the same loop:

Karpathy (autoresearch)

Agent modifies train.py, trains for 5 min, checks val_bpb, keeps or discards. ~100 experiments overnight.

Tobi / Shopify (Liquid)

Agent modifies Ruby source, runs tests + benchmarks, keeps or discards. ~120 experiments. Result: 53% faster.

The human writes the process. The agent writes the code.

The loop

LOOP:
  1. Agent proposes a change
  2. git commit (verified)
  3. Run tests → fail? → git reset, next
  4. Run benchmark → extract score
  5. Check guards → violated? → git reset, next
  6. Score improved? → keep : git reset
  7. Log experiment → check stop conditions
  8. Repeat

One file: improve.md

Part config, part prompt. Describes what to change, how to measure, and when to stop.

# autoimprove: make-it-faster
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: go test -bench=. -benchmem
score: ns/op:\s+([\d.]+)
goal: lower
guard: allocs/op: ([\d.]+) < 500
keep_if_equal: true
timeout: 3m
## Stop
budget: 4h
stale: 15
## Instructions
Reduce allocations in hot paths. Try buffer reuse, fast-path patterns.

Natural language scope

You describe what to optimize. The agent resolves it to specific files.

## Change
scope: the template parsing engine
exclude: test/, benchmark/

Resolved scope "the template parsing engine" to:
  - lib/liquid/parser.rb
  - lib/liquid/lexer.rb
  - lib/liquid/variable.rb
These are the ONLY files that will be modified. Confirm? [y/n]

exclude prevents the agent from grading its own homework.

Three-layer check

Tests

Correctness gate. Must pass for any experiment to be kept. Generated by goal-aware bootstrap.

Score

The metric to optimize. Extracted from stdout via convention, regex, or jq.

Guards

Secondary metrics that must not regress. Prevents improving speed by breaking reliability.

test: go test ./...                  # gate
run: go test -bench=. -benchmem     # score
score: ns/op:\s+([\d.]+)
guard: allocs/op: ([\d.]+) < 500    # guard

Goal-aware bootstrap

The optimization goal predicts what the agent will break.

Goal	Agent will try to...	Tests guard against...
Faster	Skip work, remove checks	Edge cases, unicode, nil, concurrency
Smaller	Remove things, swap deps	Features still work, runtime deps present
More accurate	Overfit, leak data	Data leakage, reproducibility, valid outputs
Better RAG	Game retrieval, stuff context	Format consistency, hallucination, empty results
Lower cost	Downsize, cut redundancy	Load handling, failover, durability

/autoimprove bootstrap --generate

Auto-guided setup

One command. The agent detects what's missing and walks you through it.

/autoimprove

Checking readiness...
  1. improve.md          ✗ Not found → scaffold from repo type
  2. Scope resolution     ✓ Resolved to 3 files. Confirm?
  3. Eval harness         ✗ RAG detected → build golden set
  4. Test suite           ✗ No tests → generate goal-aware tests
  5. Git state            ✓ Clean
  6. Baseline             ✓ Score: 0.42, errors: 0%

Ready. Starting optimization loop.

12 domain templates

Type	What it optimizes	Typical metric
perf	Code performance	ns/op, req/sec, allocations
ml	ML training	val_bpb, loss
automl	Tabular ML	AUC-ROC, F1
rag	RAG pipeline	answer relevancy, faithfulness
docker	Container size	image bytes
k8s	Cluster health	running pod count
prompt	Prompt quality	F1, accuracy
sql	Query performance	execution time
frontend	Bundle size	bundle bytes
ci	Build speed	build time
skill	Skill quality	eval pass rate, trigger accuracy
image	Image gen prompts	ImageReward, CLIP score

/autoimprove init --type rag

Example: performance

# autoimprove: faster-checkout-api
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: hey -n 1000 http://localhost:8080/checkout
score: Requests/sec:\s+([\d.]+)
goal: higher
guard: latency_p99: ([\d.]+) < 500
timeout: 3m
## Stop
budget: 4h | stale: 15
## Instructions
Try query batching, connection pooling, response caching.
Don't change the API contract or add dependencies.

Example: RAG pipeline

# autoimprove: better-rag-answers
## Change
scope: the RAG pipeline — chunking, retrieval, generation
exclude: data/, eval/
## Check
test: python -m pytest tests/test_pipeline.py -x
run: python eval/run_eval.py
score: answer_relevancy: ([\d.]+)
goal: higher
guard: error_rate: ([\d.]+) < 0.1
keep_if_equal: true
timeout: 5m
## Stop
budget: 6h | target: 0.92
## Instructions
Try: chunk size tuning, hybrid search, cross-encoder reranking,
query expansion, chain-of-thought generation.

Example: tabular ML

# autoimprove: better-churn-model
## Change
scope: the training pipeline
exclude: data/, evaluate.py
## Check
test: python -m pytest tests/ -x
run: python train.py && python evaluate.py
score: auc_roc: ([\d.]+)
goal: higher
guard: f1_score: ([\d.]+) > 0.6
timeout: 3m
## Stop
budget: 4h | target: 0.95
## Instructions
Try: ratio features, rolling aggregates, target encoding,
XGBoost vs LightGBM vs CatBoost, model stacking.

Goes beyond AutoML: the agent can engineer features, rewrite preprocessing, and try novel model combinations.

Real-world test: round 1

Applied to a RAG search engine: hybrid search over 44K chunks, 301 documents, 20-query golden set.

0.42

baseline

0.44

after r1

+4%

improvement

experiments

#	Experiment	Result
1	Fix keyword query sanitization (special chars crashed)	kept +0.35%
2	Adjust hybrid weights (0.4/0.6) + fetch limit	discarded
3	Replace weighted merge with Reciprocal Rank Fusion	kept +3.6%
4	Boost results appearing in both retrieval lists	discarded (0%)
5	Limit max 2 results per source for diversity	kept (equal)
6	Lower RRF constant (k=30) + 5x fetch	discarded -2.1%

Real-world test: round 2

Applied the improved protocol (per-experiment commits, guards, keep_if_equal, supersedes).

0.44

after r1

0.46

after r2

+9.3%

total gain

total exps

#	Experiment	Result
7	Boost results matching query in source name	discarded -4.8%
8	OR-mode keyword search (broader recall)	kept +1.7%
9	Fetch limit 5x	discarded
10	Better dedup key (120 chars vs 50)	kept (equal)
11	BM25 column weights (content=10x, name=2x)	kept +0.6%
12	Max 1 result per source	discarded
13	Query-text overlap boost after RRF	discarded -3.4%
14	Fetch limit 4x (supersedes #9)	kept +2.7%

What the changes actually do

OR-mode keyword search

Changed "all words must match" to "any word matches." A chunk about "growth loops" now surfaces for "how to run growth experiments" even without the word "experiments." RRF handles the noise: results matching both keyword AND semantic score highest.

BM25 column weights

Keyword matches in content (weight 10) now score 5x higher than matches in metadata (weight 2). Previously a keyword hit in the "chunk_id" column scored the same as a hit in the actual text.

Fetch limit 4x

With better keyword search (OR-mode + BM25 weights), more candidates are relevant. 4x fetch gives RRF a larger pool without drowning in noise. Guest hit rate went from 35% to 40%.

RRF (from round 1)

Replaced weighted score merge with rank-based fusion. Avoids the normalization problem of comparing BM25 scores with cosine similarity. Single biggest improvement (+3.6%).

What we learned

The eval harness is the hard part

Building the golden set and eval script took longer than running 14 experiments. Most codebases don't have a measurable score out of the box.

Tests catch real bugs

Goal-aware bootstrap caught a crash on hyphenated queries before optimization started. The agent never would have found it by tuning scores.

Per-experiment commits matter

Round 1 skipped this and lost rollback ability. Round 2 committed every experiment. Clean git reset on every discard.

Experiments interact

Fetch 5x failed in round 1 but 4x succeeded in round 2 because OR-mode keyword search changed the quality of candidates. Context matters.

Agent-agnostic

The skill runs in Claude Code. The protocol runs anywhere.

/autoimprove                                  # Claude Code (interactive)
claude -p "run /autoimprove on improve.md"    # headless overnight
/autoimprove --export                         # generates program.md
codex -p "follow program.md"                  # any agent can follow it
gemini -p "follow program.md"

Not AutoML

	AutoML	autoimprove
Search space	Predefined grid	Open-ended
Changes	Numeric knobs	Rewrite code, try new algorithms
Strategy	Bayesian optimization	AI reasoning
Scope	ML hyperparameters	Any domain with a measurable score
Ceiling	Best from your grid	Unbounded

The pattern works anywhere you have: a file to change, a command to run, and a number to improve.

Get started

/autoimprove    # that's it — setup is auto-guided

github.com/zanetworker/autoimprove

From autoresearchto autoimprove