A Simple KPI for Agentic Code Teams

Here’s a simple, high‑leverage north‑star for AI‑coding work: merged PRs per agent‑hour — how many pull requests successfully land on main per hour of agent runtime.

Contents

Why this metric

Outcome‑centric: measures shipped code, not tokens, prompts, or “attempts.”
Comparable: normalizes across models, repos, and workflows with a single denominator (agent hours).
Anti‑vanity: ignores drafts, abandoned PRs, and reverts so the number reflects real value.

Exact definition

Numerator: Count of PRs merged to main in the window, excluding drafts, auto‑reverts, and PRs that are reverted within 7 days.
Denominator: Sum of wall‑clock runtime hours for all agents involved in those PRs (planning + coding + refactor + test + review assistance).
Metric: Merged PRs per agent‑hour = valid_merged_prs / agent_runtime_hours

Guardrails (make it hard to game)

Only count PRs with: green CI, at least 1 human approval, and non‑trivial diff (e.g., ≥ 15 changed LOC or at least 2 files).
Exclude PRs labeled chore/deps unless they include passing tests written/updated by the agent.
Track a 7‑day revert rate alongside the metric. If revert rate > 5%, pause experiments.

Minimal implementation (7 steps)

Tag agent runs: log run_id, start/stop timestamps, repo, branch, PR number.
Join with VCS: from GitHub/GitLab API pull merged_at, is_draft, labels, CI status, approvals, reverts.
Filter valid PRs: apply the rules above.
Aggregate runtime: sum agent run hours per PR (if multiple runs touch one PR, sum them).
Compute core metric daily + weekly; add p50/p90 time‑to‑merge and tests‑added per PR as secondary.
Slice: by repo, model, prompt pack, retrieval config, and reviewer to find winners.
Alerting: if metric drops >20% WoW or revert rate spikes, notify in Slack.

Example SQL (BigQuery‑style, adapt as needed)

WITH agent_runs AS (
  SELECT pr_number, SUM(TIMESTAMP_DIFF(ended_at, started_at, MINUTE))/60.0 AS agent_hours
  FROM runs
  WHERE started_at >= @from AND ended_at < @to
  GROUP BY pr_number
),
valid_prs AS (
  SELECT p.pr_number
  FROM prs p
  LEFT JOIN reverts r ON r.reverted_pr = p.pr_number AND r.created_at BETWEEN p.merged_at AND TIMESTAMP_ADD(p.merged_at, INTERVAL 7 DAY)
  WHERE p.base_branch = 'main'
    AND p.is_draft = FALSE
    AND p.merged_at BETWEEN @from AND @to
    AND p.ci_status = 'success'
    AND p.approvals_count >= 1
    AND p.changed_loc >= 15
    AND (r.reverted_pr IS NULL)
    AND NOT ARRAY_TO_STRING(p.labels, ',') LIKE '%chore/deps%'
)
SELECT
  COUNT(v.pr_number) / NULLIF(SUM(a.agent_hours), 0) AS merged_prs_per_agent_hour,
  COUNT(v.pr_number) AS merged_prs,
  SUM(a.agent_hours) AS agent_hours
FROM valid_prs v
JOIN agent_runs a USING (pr_number);

Dashboard (keep it tiny)

Top‑line: Merged PRs / agent‑hour (7d, 28d)
Quality rails: Revert rate (7d), CI pass‑on‑first‑try %
Speed: p50/p90 time‑to‑merge
Breakdowns: by model → by repo → by task type
Leaderboard: prompt pack / tool config by uplift vs. baseline

How to use it weekly

Ship two changes at a time (model + prompt = too many confounders).
Hold a 20‑minute review: if a variant improves the metric and keeps revert rate low, promote it; otherwise roll back.