Here’s a simple, high‑leverage north‑star for AI‑coding work: merged PRs per agent‑hour — how many pull requests successfully land on main per hour of agent runtime.


Why this metric

  • Outcome‑centric: measures shipped code, not tokens, prompts, or “attempts.”
  • Comparable: normalizes across models, repos, and workflows with a single denominator (agent hours).
  • Anti‑vanity: ignores drafts, abandoned PRs, and reverts so the number reflects real value.

Exact definition

  • Numerator: Count of PRs merged to main in the window, excluding drafts, auto‑reverts, and PRs that are reverted within 7 days.
  • Denominator: Sum of wall‑clock runtime hours for all agents involved in those PRs (planning + coding + refactor + test + review assistance).
  • Metric: Merged PRs per agent‑hour = valid_merged_prs / agent_runtime_hours

Guardrails (make it hard to game)

  • Only count PRs with: green CI, at least 1 human approval, and non‑trivial diff (e.g., ≥ 15 changed LOC or at least 2 files).
  • Exclude PRs labeled chore/deps unless they include passing tests written/updated by the agent.
  • Track a 7‑day revert rate alongside the metric. If revert rate > 5%, pause experiments.

Minimal implementation (7 steps)

  1. Tag agent runs: log run_id, start/stop timestamps, repo, branch, PR number.
  2. Join with VCS: from GitHub/GitLab API pull merged_at, is_draft, labels, CI status, approvals, reverts.
  3. Filter valid PRs: apply the rules above.
  4. Aggregate runtime: sum agent run hours per PR (if multiple runs touch one PR, sum them).
  5. Compute core metric daily + weekly; add p50/p90 time‑to‑merge and tests‑added per PR as secondary.
  6. Slice: by repo, model, prompt pack, retrieval config, and reviewer to find winners.
  7. Alerting: if metric drops >20% WoW or revert rate spikes, notify in Slack.

Example SQL (BigQuery‑style, adapt as needed)

WITH agent_runs AS (
  SELECT pr_number, SUM(TIMESTAMP_DIFF(ended_at, started_at, MINUTE))/60.0 AS agent_hours
  FROM runs
  WHERE started_at >= @from AND ended_at < @to
  GROUP BY pr_number
),
valid_prs AS (
  SELECT p.pr_number
  FROM prs p
  LEFT JOIN reverts r ON r.reverted_pr = p.pr_number AND r.created_at BETWEEN p.merged_at AND TIMESTAMP_ADD(p.merged_at, INTERVAL 7 DAY)
  WHERE p.base_branch = 'main'
    AND p.is_draft = FALSE
    AND p.merged_at BETWEEN @from AND @to
    AND p.ci_status = 'success'
    AND p.approvals_count >= 1
    AND p.changed_loc >= 15
    AND (r.reverted_pr IS NULL)
    AND NOT ARRAY_TO_STRING(p.labels, ',') LIKE '%chore/deps%'
)
SELECT
  COUNT(v.pr_number) / NULLIF(SUM(a.agent_hours), 0) AS merged_prs_per_agent_hour,
  COUNT(v.pr_number) AS merged_prs,
  SUM(a.agent_hours) AS agent_hours
FROM valid_prs v
JOIN agent_runs a USING (pr_number);

Dashboard (keep it tiny)

  • Top‑line: Merged PRs / agent‑hour (7d, 28d)
  • Quality rails: Revert rate (7d), CI pass‑on‑first‑try %
  • Speed: p50/p90 time‑to‑merge
  • Breakdowns: by model → by repo → by task type
  • Leaderboard: prompt pack / tool config by uplift vs. baseline

How to use it weekly

  • Ship two changes at a time (model + prompt = too many confounders).
  • Hold a 20‑minute review: if a variant improves the metric and keeps revert rate low, promote it; otherwise roll back.

You May Also Like

Salesforce Expands Agentforce 360 with OpenAI

Executive SummarySalesforce and OpenAI have launched Agentforce 360 inside ChatGPT, enabling real-time…

Walmart‑OpenAI agentic commerce partnership: impact on competition and customers

Introduction On 14 October 2025 Walmart announced a strategic partnership with OpenAI to allow…

Enterprise AI Wins Backed by Metrics (2024–2025)

Below is a compact, metrics-driven roundup of enterprise AI deployments that demonstrably…

Europe Builds Its Own AI Fortress: Inside the Continent’s Sovereign Cloud Push

The Story So Far Cloud Services in a Month: Build a Successful…