Here’s a simple, high‑leverage north‑star for AI‑coding work: merged PRs per agent‑hour — how many pull requests successfully land on main per hour of agent runtime.
Why this metric
- Outcome‑centric: measures shipped code, not tokens, prompts, or “attempts.”
- Comparable: normalizes across models, repos, and workflows with a single denominator (agent hours).
- Anti‑vanity: ignores drafts, abandoned PRs, and reverts so the number reflects real value.
Exact definition
- Numerator: Count of PRs merged to
mainin the window, excluding drafts, auto‑reverts, and PRs that are reverted within 7 days. - Denominator: Sum of wall‑clock runtime hours for all agents involved in those PRs (planning + coding + refactor + test + review assistance).
- Metric:
Merged PRs per agent‑hour = valid_merged_prs / agent_runtime_hours
Guardrails (make it hard to game)
- Only count PRs with: green CI, at least 1 human approval, and non‑trivial diff (e.g., ≥ 15 changed LOC or at least 2 files).
- Exclude PRs labeled
chore/depsunless they include passing tests written/updated by the agent. - Track a 7‑day revert rate alongside the metric. If revert rate > 5%, pause experiments.
Minimal implementation (7 steps)
- Tag agent runs: log run_id, start/stop timestamps, repo, branch, PR number.
- Join with VCS: from GitHub/GitLab API pull
merged_at,is_draft, labels, CI status, approvals, reverts. - Filter valid PRs: apply the rules above.
- Aggregate runtime: sum agent run hours per PR (if multiple runs touch one PR, sum them).
- Compute core metric daily + weekly; add p50/p90 time‑to‑merge and tests‑added per PR as secondary.
- Slice: by repo, model, prompt pack, retrieval config, and reviewer to find winners.
- Alerting: if metric drops >20% WoW or revert rate spikes, notify in Slack.
Example SQL (BigQuery‑style, adapt as needed)
WITH agent_runs AS (
SELECT pr_number, SUM(TIMESTAMP_DIFF(ended_at, started_at, MINUTE))/60.0 AS agent_hours
FROM runs
WHERE started_at >= @from AND ended_at < @to
GROUP BY pr_number
),
valid_prs AS (
SELECT p.pr_number
FROM prs p
LEFT JOIN reverts r ON r.reverted_pr = p.pr_number AND r.created_at BETWEEN p.merged_at AND TIMESTAMP_ADD(p.merged_at, INTERVAL 7 DAY)
WHERE p.base_branch = 'main'
AND p.is_draft = FALSE
AND p.merged_at BETWEEN @from AND @to
AND p.ci_status = 'success'
AND p.approvals_count >= 1
AND p.changed_loc >= 15
AND (r.reverted_pr IS NULL)
AND NOT ARRAY_TO_STRING(p.labels, ',') LIKE '%chore/deps%'
)
SELECT
COUNT(v.pr_number) / NULLIF(SUM(a.agent_hours), 0) AS merged_prs_per_agent_hour,
COUNT(v.pr_number) AS merged_prs,
SUM(a.agent_hours) AS agent_hours
FROM valid_prs v
JOIN agent_runs a USING (pr_number);
Dashboard (keep it tiny)
- Top‑line: Merged PRs / agent‑hour (7d, 28d)
- Quality rails: Revert rate (7d), CI pass‑on‑first‑try %
- Speed: p50/p90 time‑to‑merge
- Breakdowns: by model → by repo → by task type
- Leaderboard: prompt pack / tool config by uplift vs. baseline
How to use it weekly
- Ship two changes at a time (model + prompt = too many confounders).
- Hold a 20‑minute review: if a variant improves the metric and keeps revert rate low, promote it; otherwise roll back.