Here’s a simple, high‑leverage north‑star for AI‑coding work: merged PRs per agent‑hour — how many pull requests successfully land on main per hour of agent runtime.


Why this metric

  • Outcome‑centric: measures shipped code, not tokens, prompts, or “attempts.”
  • Comparable: normalizes across models, repos, and workflows with a single denominator (agent hours).
  • Anti‑vanity: ignores drafts, abandoned PRs, and reverts so the number reflects real value.

Exact definition

  • Numerator: Count of PRs merged to main in the window, excluding drafts, auto‑reverts, and PRs that are reverted within 7 days.
  • Denominator: Sum of wall‑clock runtime hours for all agents involved in those PRs (planning + coding + refactor + test + review assistance).
  • Metric: Merged PRs per agent‑hour = valid_merged_prs / agent_runtime_hours

Guardrails (make it hard to game)

  • Only count PRs with: green CI, at least 1 human approval, and non‑trivial diff (e.g., ≥ 15 changed LOC or at least 2 files).
  • Exclude PRs labeled chore/deps unless they include passing tests written/updated by the agent.
  • Track a 7‑day revert rate alongside the metric. If revert rate > 5%, pause experiments.

Minimal implementation (7 steps)

  1. Tag agent runs: log run_id, start/stop timestamps, repo, branch, PR number.
  2. Join with VCS: from GitHub/GitLab API pull merged_at, is_draft, labels, CI status, approvals, reverts.
  3. Filter valid PRs: apply the rules above.
  4. Aggregate runtime: sum agent run hours per PR (if multiple runs touch one PR, sum them).
  5. Compute core metric daily + weekly; add p50/p90 time‑to‑merge and tests‑added per PR as secondary.
  6. Slice: by repo, model, prompt pack, retrieval config, and reviewer to find winners.
  7. Alerting: if metric drops >20% WoW or revert rate spikes, notify in Slack.

Example SQL (BigQuery‑style, adapt as needed)

WITH agent_runs AS (
  SELECT pr_number, SUM(TIMESTAMP_DIFF(ended_at, started_at, MINUTE))/60.0 AS agent_hours
  FROM runs
  WHERE started_at >= @from AND ended_at < @to
  GROUP BY pr_number
),
valid_prs AS (
  SELECT p.pr_number
  FROM prs p
  LEFT JOIN reverts r ON r.reverted_pr = p.pr_number AND r.created_at BETWEEN p.merged_at AND TIMESTAMP_ADD(p.merged_at, INTERVAL 7 DAY)
  WHERE p.base_branch = 'main'
    AND p.is_draft = FALSE
    AND p.merged_at BETWEEN @from AND @to
    AND p.ci_status = 'success'
    AND p.approvals_count >= 1
    AND p.changed_loc >= 15
    AND (r.reverted_pr IS NULL)
    AND NOT ARRAY_TO_STRING(p.labels, ',') LIKE '%chore/deps%'
)
SELECT
  COUNT(v.pr_number) / NULLIF(SUM(a.agent_hours), 0) AS merged_prs_per_agent_hour,
  COUNT(v.pr_number) AS merged_prs,
  SUM(a.agent_hours) AS agent_hours
FROM valid_prs v
JOIN agent_runs a USING (pr_number);

Dashboard (keep it tiny)

  • Top‑line: Merged PRs / agent‑hour (7d, 28d)
  • Quality rails: Revert rate (7d), CI pass‑on‑first‑try %
  • Speed: p50/p90 time‑to‑merge
  • Breakdowns: by model → by repo → by task type
  • Leaderboard: prompt pack / tool config by uplift vs. baseline

How to use it weekly

  • Ship two changes at a time (model + prompt = too many confounders).
  • Hold a 20‑minute review: if a variant improves the metric and keeps revert rate low, promote it; otherwise roll back.

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

"Looks Good To Me": Constructive code reviews

"Looks Good To Me": Constructive code reviews

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Automating DevOps with GitLab CI/CD Pipelines: Build efficient CI/CD pipelines to verify, secure, and deploy your code using real-life examples

Automating DevOps with GitLab CI/CD Pipelines: Build efficient CI/CD pipelines to verify, secure, and deploy your code using real-life examples

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

LIFUFUTEE PS5 Stand, PS5/Slim/Pro Cooling Station with 2.36in Cooling Fan, Controller Charger Station for PS5 Accessories, RGB Light, Headset Holder, 3 Charger Ports for PlayStation 5 Slim/Pro Console

LIFUFUTEE PS5 Stand, PS5/Slim/Pro Cooling Station with 2.36in Cooling Fan, Controller Charger Station for PS5 Accessories, RGB Light, Headset Holder, 3 Charger Ports for PlayStation 5 Slim/Pro Console

For PS5/Slim/Pro Consoles – This ps5 stand is compatible with ps5 and ps5 slim/pro consoles(using the included Slim/Pro…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

You May Also Like

Glasspane: Turning IT Transparency Into a Competitive Advantage

Discover how Glasspane transforms IT transparency into a powerful edge. Real-time dashboards, AI insights, and better vendor accountability at your fingertips.

GPU Memory Fragmentation: Causes and Remedies

Just understanding GPU memory fragmentation’s causes and solutions can significantly enhance your graphics performance; discover how to fix it now.

How to Build a Personal AI Assistant Using Open-Source Models

Learn how to create a customizable, open source personal AI assistant with detailed steps, tools, and best practices for tailored AI solutions.

HBM3E Deep Dive: Memory Bandwidth Bottlenecks in LLM Training

While HBM3E significantly boosts memory bandwidth for LLM training, underlying bottlenecks may still limit performance—discover how these challenges can be addressed.