A tour of the metrics that describe how software teams actually perform — grouped by what they measure: delivery, reliability, productivity, frontend performance, quality, and agile process. What each one means, when it's useful, and how it goes wrong.
Why metrics (and why they mislead)
A metric is a signal, not a target. The moment you turn a signal into a goal people are rewarded for, they optimize the number — and the number stops describing reality. This is Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Measure developers by lines of code and you get bloated code; measure by tickets closed and tickets get smaller and more numerous. The metric goes up while the thing you actually cared about stays flat or gets worse.
So the first distinction to keep straight is health metrics vs vanity metrics. A vanity metric looks good in a slide and moves the way you want without telling you anything actionable (total registered users, cumulative commits). A health metric changes your decisions: if it moves, you do something different tomorrow.
A few cross-cutting ideas show up in every category below:
| Concept | What it means | Example |
|---|---|---|
| Leading indicator | Predicts a future outcome; you can still act on it | PR review latency, error-budget burn rate |
| Lagging indicator | Confirms an outcome after the fact | Quarterly churn, escaped-defect count |
| Guardrail / counter-metric | A paired metric that goes bad if you game the primary one | Pair deploy frequency with change failure rate |
| North Star metric | The single number that best proxies durable customer value | "Weekly active teams", "nights booked" |
The guardrail is the concrete antidote to Goodhart: you never track a metric alone, you track it against its natural counterweight. Push deployment frequency up and change failure rate is supposed to stay flat — if it climbs, you're shipping faster by shipping worse. Push code coverage up and mutation score should climb with it — if it doesn't, you're writing tests that execute code without asserting anything. The pairing is what makes the number honest.
Keep that frame for the rest of the tour: every metric below is useful as a signal, and almost every one is dangerous as a target.
Delivery & DevOps (DORA)
The DORA metrics (from Google's DevOps Research and Assessment program, popularized by the book Accelerate) are the most battle-tested set of delivery metrics. They split into two throughput measures and two stability measures — and the research's key finding is that the best teams are elite at both, killing the old myth that speed and stability trade off against each other.
| Metric | Question it answers | Type |
|---|---|---|
| Deployment frequency | How often do we ship to production? | Throughput |
| Lead time for changes | Commit → running in production, how long? | Throughput |
| Change failure rate (CFR) | What % of deploys cause a failure needing remediation? | Stability |
| Time to restore service (MTTR) | How fast do we recover from a failure? | Stability |
The rough performance bands DORA reports look like this (they drift year to year, so treat them as orders of magnitude, not exact cutoffs):
| Band | Deploy frequency | Lead time | Change failure rate | Time to restore |
|---|---|---|---|---|
| On-demand (multiple/day) | < 1 day | 0–15% | < 1 hour | |
| Daily–weekly | 1 day–1 week | 16–30% | < 1 day | |
| Weekly–monthly | 1 week–1 month | 16–30% | 1 day–1 week | |
| Monthly–biannually | 1–6 months | 16–30%+ | 1 week–1 month |
The throughput pair (deploy frequency, lead time) and the stability pair (CFR, MTTR) are each other's guardrails: chasing frequency while CFR climbs means you're just shipping bugs faster.
Underneath DORA sits CI/CD pipeline health. Lead time is only as good as the pipeline that produces it, so the delivery machine has its own metrics:
| Pipeline metric | Why it matters |
|---|---|
| Build duration | Long builds lengthen lead time and kill flow; the feedback loop should be minutes, not hours |
| Pipeline success rate | A green-main default; chronic red on main blocks everyone |
| Queue / wait time | Time a job waits for a runner — invisible drag that inflates lead time |
| Flaky-test rate | % of failures that pass on re-run; flakiness trains engineers to ignore red, which is how real failures ship |
| MTTR for red builds | How fast a broken main gets back to green — a broken pipeline is an outage for the whole team |
Reliability: SLIs, SLOs & SLAs
Reliability metrics come in three layers that people constantly conflate:
| Term | Full name | What it is | Audience |
|---|---|---|---|
| SLI | Service Level Indicator | The actual measurement, e.g. "% of requests < 300 ms" | Engineers |
| SLO | Service Level Objective | Your internal target for that SLI, e.g. "99.9% over 30 days" | Engineering + product |
| SLA | Service Level Agreement | A contractual promise to customers, with penalties if breached | Legal / customers |
The relationship is nested: you measure an SLI, hold yourself to an SLO that is stricter than any SLA you sign, so you have room to react before you owe a refund.
Error budgets turn the SLO into a spending account. If your SLO is 99.9%, you're allowed 0.1% unreliability — that's the budget. Spend it slowly with background noise, or all at once on a risky launch, your call. When the budget is healthy, ship features fast; when it's exhausted, you freeze risky changes and pay down reliability. This is the mechanism that lets speed and stability coexist.
Availability math (the "nines") is worth memorizing because the intuition is non-linear — each nine is ~10× harder:
| Availability | "Nines" | Downtime / year | Downtime / month |
|---|---|---|---|
| 99% | two nines | ~3.65 days | ~7.2 hours |
| 99.9% | three nines | ~8.77 hours | ~43 minutes |
| 99.99% | four nines | ~52.6 minutes | ~4.3 minutes |
| 99.999% | five nines | ~5.26 minutes | ~26 seconds |
Latency is measured in percentiles, never averages. An average hides the tail: p50 (median) tells you the typical experience, p95/p99 tell you what your unhappiest users feel. A p99 of 2s means 1 in 100 requests is miserable — and on a page with 100 assets, almost every page load hits it.
Underneath the SLx layer are the raw signals you actually instrument. Three well-known frameworks, each suited to a different thing:
| Framework | Signals | Best for |
|---|---|---|
| Four Golden Signals (Google SRE) | Latency, Traffic, Errors, Saturation | Any user-facing service |
| RED | Rate, Errors, Duration | Request-driven microservices |
| USE | Utilization, Saturation, Errors | Resources: CPU, disk, memory, queues |
RED is the request-centric view (what your users hit); USE is the resource-centric view (what your machines feel); the Golden Signals are essentially RED plus saturation. Most teams use RED for services and USE for infrastructure.
Finally, alerting has its own quality metrics — because an alert that no one trusts is worse than no alert:
- MTTA (mean time to acknowledge) — how long from alert firing to a human owning it.
- Signal-to-noise ratio — % of alerts that were actionable vs pure noise. Chronic false positives cause alert fatigue, where real pages get ignored. A good target is that nearly every page a human wakes up for was worth waking up for.
Productivity & flow
"Developer productivity" is where measurement goes to die, because the tempting metrics are the worst ones. Lines of code, commits per developer, and story points per person are traps — they measure motion, not value. LOC rewards verbosity (the best change is often a deletion); commit count rewards splitting work into confetti; per-person output punishes the collaboration, mentoring, and review that make a team fast. All three fail Goodhart instantly and all three penalize senior engineers, whose highest-value work often produces the least code.
The credible answer is the SPACE framework (GitHub / Microsoft Research), whose core message is that productivity is multi-dimensional and you should pick one metric from several dimensions rather than optimizing any single one:
| SPACE dimension | Captures | Example metric |
|---|---|---|
| Satisfaction & well-being | Are devs happy and healthy? | eNPS, burnout survey |
| Performance | Outcome quality, not output | Change failure rate, reliability |
| Activity | Volume of actions (use with care) | PRs, deploys, design docs |
| Communication & collaboration | How work flows between people | Review latency, discoverability of docs |
| Efficiency & flow | Ability to work with minimal interruption | Cycle time, handoff count |
For the day-to-day, flow metrics are the most actionable subset:
- Cycle time — first commit → in production for a unit of work. The single best flow signal; short cycle time means fast feedback and small batches.
- Work in progress (WIP) — how many things are in flight at once. High WIP means lots of context-switching and stalled, half-done work; limiting WIP is the fastest way to cut cycle time (Little's Law: cycle time = WIP ÷ throughput).
- Throughput — items completed per unit time. Useful as a trend, dangerous as a target (it's one keystroke from "count the tickets").
The honest framing: measure flow and outcomes at the team level, never rank individuals by output. As soon as a productivity metric has a name attached to it and a bonus riding on it, it stops measuring productivity.
Developer experience & team health
Where the previous section measured output and flow, this one measures the humans and the environment producing them. DevEx (Developer Experience) is the successor framing to SPACE — same authors, a sharper lens. It argues developer effectiveness is best understood through three feedback loops:
| DevEx dimension | The question | Signals |
|---|---|---|
| Feedback loops | How fast do I learn if my work is good? | Build/test time, CI wait, review latency, deploy time |
| Cognitive load | How hard is it to get things done? | Onboarding time, doc quality, incidental complexity |
| Flow state | Can I focus without interruption? | Uninterrupted focus time, meeting load, context switches |
Around those sit the "human" metrics teams routinely forget to measure — they don't show up in a delivery dashboard, but they predict whether the delivery dashboard will still look good in a year:
| Metric | What it tells you | Watch for |
|---|---|---|
| On-call load & toil | Manual, repetitive, automatable work as % of time | > ~50% toil means no time to improve the system |
| Attrition / retention | Are people leaving? Regretted vs non-regretted | Regretted attrition is a leading indicator of deeper rot |
| Onboarding / ramp time | Time to first PR, time to productive | Long ramp = high cognitive load and poor docs |
| Bus factor | How many people can leave before knowledge is lost | A bus factor of 1 on a critical system is a live risk |
| Burnout & psychological safety | Sustainable pace; safety to speak up and fail | Measured via surveys (eNPS, safety index), not telemetry |
Two cautions specific to this category. First, most of these are survey-based, and that's fine — self-reported experience is a legitimate, well-validated measurement; you don't need everything to come from a log. Second, these are the metrics most vulnerable to "we measured it once for a QBR and never acted." Their whole value is as a leading indicator: burnout and a bus factor of 1 are cheap to see and expensive to ignore.
Frontend performance
For anything users load in a browser, performance is measured mostly through Google's Core Web Vitals (LCP, INP, CLS) — plus the supporting signals (TTFB, bundle size), the lab-vs-field-data distinction, client-side reliability, and accessibility that sit alongside them. It's a big enough topic to get its own deep-dive.
→ See the dedicated post: Measuring the Frontend: Metrics That Actually Matter — Core Web Vitals, lab vs field data (Lighthouse / RUM / CrUX), TTFB and bundle size, client-side reliability, accessibility (a11y), the SEO overlap, and the frontend-adjacent product signals.
Quality & bugs
Quality metrics try to answer "how much is broken, and are we finding it before customers do?" The most important is where defects are caught:
| Metric | Definition | Good direction |
|---|---|---|
| Defect escape rate | % of defects found in production vs pre-release | Lower — you want them caught in test, not by users |
| Escaped-defect density | Production defects per unit (KLOC, feature, release) | Lower and stable over releases |
| MTTR (bugs) | Mean time to resolve a defect once reported | Lower |
| Defect reopen rate | % of "fixed" bugs that come back | Lower — high reopen = superficial fixes |
Escape rate is the headline: a team can have many bugs but a low escape rate (great test net) or few reported bugs but a high escape rate (users are your QA). The second is far more dangerous.
Code coverage deserves its own warning. Coverage measures which lines ran during tests — not whether anything was asserted about them. You can have 100% coverage and zero real verification. Treat it as a floor, not a target:
- Chasing a high coverage number produces tests that call code and assert nothing.
- Mutation testing is the honest guardrail: it deliberately introduces bugs and checks whether your tests catch them. A high mutation score means your tests actually assert behavior; pair it with coverage so coverage can't be gamed.
- ~70–80% meaningful coverage on the code that matters beats 95% of everything.
Code churn — how often a file is rewritten shortly after being written — is a quality signal too: high churn on new code often means unclear requirements or thrashing, and churn concentrated in a few files points at fragile, hard-to-get-right areas (which links straight to hotspots in the next section).
Code health & code review
These are often the most actionable metrics on the whole list, because unlike a team-wide average they point at a specific file or a specific habit you can change tomorrow.
Code health metrics describe the codebase itself:
| Metric | What it measures | Why it matters |
|---|---|---|
| Cyclomatic complexity | Number of independent paths through a function | High complexity = hard to test, easy to break |
| Coupling / cohesion | How tangled modules are / how focused each is | Low coupling + high cohesion = changes stay local |
| Churn & hotspots | Files that change often and are complex | The intersection is where bugs concentrate |
| Dependency staleness | How far behind your dependencies are | Old deps = security risk + painful upgrades later |
| Technical debt (ratio) | Estimated remediation cost vs cost to build | A trend line for "are we getting better or worse?" |
The single most useful move here is the hotspot map: overlay complexity against churn. A file that's complex but never changes is fine; a file that's complex and changes constantly is where your incidents come from. That intersection tells you exactly where refactoring pays off — a far better prioritizer than a global "technical debt score".
Code review metrics describe a habit rather than a file, and small changes here move delivery a lot:
| Metric | Healthy pattern |
|---|---|
| PR size | Small — big PRs get rubber-stamped; reviewability collapses past a few hundred lines |
| Time-to-first-review | Short — a PR waiting on review is blocked WIP inflating cycle time |
| Time-to-merge | Short and predictable, once approved |
| Review coverage | % of changes that actually got a substantive review, not a drive-by 👍 |
PR size is the lever most worth pulling: shrink PRs and review latency, review quality, and escaped defects all improve together.
Agile & process health
These metrics describe how work moves through your process. The famous — and most abused — one is velocity: story points completed per sprint.
Velocity's legitimate use is capacity planning for a single, stable team: "this team completes roughly 30 points a sprint, so a 90-point epic is about three sprints." Its illegitimate uses are everything managers reach for: comparing teams (points are relative and team-specific — meaningless across teams), and setting velocity as a target. The moment velocity is a goal, teams inflate estimates and the number rises while nothing ships faster — Goodhart again. Velocity is a planning input, never a performance score.
The healthier process metrics focus on predictability and flow, not raw speed:
| Metric | What it shows | Read it for |
|---|---|---|
| Sprint predictability | Committed vs completed, sprint over sprint | Can we be trusted to deliver what we forecast? |
| Burndown chart | Remaining work over the sprint | Are we on track, or is work discovered late? |
| Cumulative flow diagram (CFD) | Count of items in each state over time | Bottlenecks — a widening band = a stage backing up |
Predictability beats velocity as a health signal: a team that reliably delivers 20 points is more valuable than one that swings between 10 and 40. And the cumulative flow diagram is the most diagnostic of the three — when the "In Progress" band keeps widening while "Done" grows slowly, you've found your bottleneck without needing anyone to report it.
Product & user engagement
Engineering metrics tell you the machine runs well; product metrics tell you whether anyone cares. A fast, reliable product nobody uses is still a failure, so these close the loop back to value.
| Metric | Definition | Signals |
|---|---|---|
| DAU / MAU | Daily / Monthly Active Users | Reach and habit |
| Stickiness | DAU ÷ MAU | How many monthly users show up daily — habit strength |
| Retention (cohort) | % of a signup cohort still active after N days/weeks | The truest measure of product-market fit |
| Churn | % of users/revenue lost per period | The inverse of retention; compounds brutally |
| Activation | % of new users reaching the "aha" moment | Whether onboarding actually delivers value |
| Feature adoption | % of users using a given feature | Whether what you built was worth building |
| Session duration & depth | Time per session, pages/actions per session | Engagement — context-dependent (see below) |
| Bounce / exit rate | % leaving after one page / where they leave | Friction and dead ends |
| NPS / CSAT | Net Promoter Score / Customer Satisfaction | Sentiment, self-reported |
Two things to keep honest here. First, stickiness (DAU/MAU) and retention are the ones that matter most — they measure durable value, whereas raw DAU can be pumped with notifications and vanity growth. Second, engagement metrics are direction-ambiguous: a longer session is good for a game or social app and bad for a checkout flow or support tool, where the goal is to get users done fast. Always interpret them against what the product is for — never assume "more time on site" is a win.
Cohort retention curves are the single most revealing product view: plot each signup cohort's survival over time. A curve that flattens to a horizontal asymptote means you've found a durable core of users (product-market fit); a curve that decays to zero means you're filling a leaky bucket no matter how much you spend on acquisition.
SEO & marketing
These sit at the boundary between engineering and growth, and engineers own more of them than they realize — because a chunk of SEO is frontend performance.
Acquisition / SEO metrics — how people find you:
| Metric | What it measures |
|---|---|
| Organic traffic | Visitors arriving from unpaid search |
| Impressions | How often you appear in search results |
| Keyword rankings | Your position for target search terms |
| Click-through rate (CTR) | Clicks ÷ impressions — how compelling your result is |
The engineering hook: Core Web Vitals are a Google ranking factor. The LCP / INP / CLS thresholds (see Measuring the Frontend) aren't only a UX concern — poor field CWV can suppress your ranking, which suppresses impressions, which suppresses organic traffic. Frontend performance work is SEO work.
Conversion / economics — whether that traffic turns into a business:
| Metric | Definition |
|---|---|
| Conversion rate | % of visitors completing the goal action (signup, purchase) |
| Funnel drop-off | Where in the multi-step flow users abandon |
| CAC | Customer Acquisition Cost — total spend ÷ new customers |
| LTV | Lifetime Value — total revenue expected from a customer |
The relationship that decides whether the whole thing is a business is the LTV:CAC ratio. A customer must be worth more than it costs to acquire them; a common rule of thumb is LTV:CAC ≥ 3:1. Below ~1:1 you lose money on every customer and growth just accelerates the losses — one of the clearest examples of a technical/product metric (retention → LTV) tying directly to a survival-level business number.
Security & DevSecOps
The DevSecOps shift is to treat security as a continuous metric woven through the pipeline, not a gate at the end. The "shift-left" framing means catching issues while code is being written — where they're cheap — instead of in a pre-release pentest, where they're expensive and late.
| Metric | Definition | Good direction |
|---|---|---|
| MTTD | Mean Time To Detect a vulnerability or breach | Lower |
| MTTR / mean time to patch | Detect → remediated/patched | Lower |
| Vulnerability density | Known vulns per unit of code | Lower |
| CVE / CVSS exposure | Count and severity of known vulns (CVEs), scored by CVSS | Fewer, lower severity |
| % of code & dependencies scanned | Coverage of SAST / dependency scanning | Higher — blind spots hide risk |
| Secrets-leak rate | Credentials/keys committed to the repo | Zero, caught pre-commit |
| Security debt | Backlog of known-but-unfixed findings, weighted by severity | Trending down |
A few definitions worth pinning down: a CVE (Common Vulnerabilities and Exposures) is a publicly catalogued vulnerability; CVSS (Common Vulnerability Scoring System) is the 0–10 severity score attached to it. SAST (Static Application Security Testing) scans your source; DAST scans the running app; SCA (Software Composition Analysis) scans your dependencies for known CVEs.
The two metrics that matter most are MTTD and mean time to patch. A vulnerability's danger is roughly proportional to how long it sits exposed — the window between "a fix exists" and "we deployed it" is your real risk surface. Time-to-patch is the security analog of DORA's time-to-restore, and elite security orgs measure it in hours for critical CVEs, not weeks.
Coverage metrics (% scanned) are the guardrail: an impressive "0 known vulnerabilities" is meaningless if only 20% of the code and none of the dependencies were scanned — you're not secure, you're just not looking.
Cost & efficiency (FinOps)
FinOps brings a financial lens to engineering: it makes cloud spend a first-class engineering metric that developers can see and influence, rather than a bill finance discovers at month-end. This is the category that ties every other metric back to a dollar figure, which is what makes it the bridge to business outcomes.
| Metric | Definition | Why it matters |
|---|---|---|
| Cloud spend trend | Total infrastructure cost over time | The top-line; catch runaway growth early |
| Unit economics | Cost per request / per user / per transaction | The one that scales — normalizes cost against usage |
| Utilization vs waste | % of provisioned resources actually used | Idle capacity and orphaned resources are pure waste |
| Cost per deploy | Infrastructure cost attributable to shipping | Ties delivery velocity to its real price |
The standout is unit economics. Absolute cloud spend should grow as you grow — that's not a problem. The question is whether cost per user (or per request) is flat, falling, or rising. Falling unit cost means you're scaling efficiently; rising unit cost means every new customer makes the economics worse, and no amount of growth fixes that. It's the FinOps equivalent of the LTV:CAC check.
This is also where a technical metric becomes a business argument: "our p99 latency dropped" is an engineering claim; "we cut cost-per-request 40% while holding p99" is a business-level result. Expressing efficiency work in unit-economic terms is how engineering reliability and performance improvements get funded.
Putting it together
You've now seen well over a hundred metrics across a dozen categories. The mistake would be to track all of them. A dashboard with 100 numbers is a dashboard nobody reads; the skill is choosing a small, balanced set that tells the truth about your specific system.
A few principles pulled from everything above:
-
Pick a balanced few, not an exhaustive many. Cover a handful of dimensions — delivery, reliability, quality, developer experience, and a product/business outcome — with one or two metrics each. DORA's four keys plus an SLO plus a North Star is a better dashboard than fifty widgets.
-
Never optimize a single metric. Every metric in this post is dangerous alone. The most reliable way to make a system worse is to pick one number and push it hard.
-
Always pair a metric with its guardrail. This is the concrete defense against Goodhart, and it's the recurring theme of the whole tour:
Primary metric Guardrail / counter-metric Deployment frequency Change failure rate Velocity / throughput Sprint predictability, quality Code coverage Mutation score Cloud spend reduction Reliability (SLO), latency Feature shipping speed Escaped-defect rate, DevEx -
Tie everything to an outcome. For each metric, ask: if this moves, what decision do we make, and what does the customer or business get? If there's no answer, it's a vanity metric — stop tracking it.
-
Measure teams and systems, not individuals. Every metric that gets attached to a person's name and a bonus stops measuring the thing and starts measuring the incentive.
The whole point of measurement is better decisions, not better dashboards. A metric earns its place only if it changes what you do — everything else is decoration. Choose few, pair each with its counter, tie each to an outcome, and re-examine the set whenever it stops telling you something you didn't already know.
