Skip to main content

Measuring Software Engineering: Metrics That Actually Matter

· 22 min read
Pere Pages
Software Engineer
A dashboard of software engineering metrics

A tour of the metrics that describe how software teams actually perform — grouped by what they measure: delivery, reliability, productivity, frontend performance, quality, and agile process. What each one means, when it's useful, and how it goes wrong.

Why metrics (and why they mislead)

A metric is a signal, not a target. The moment you turn a signal into a goal people are rewarded for, they optimize the number — and the number stops describing reality. This is Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Measure developers by lines of code and you get bloated code; measure by tickets closed and tickets get smaller and more numerous. The metric goes up while the thing you actually cared about stays flat or gets worse.

So the first distinction to keep straight is health metrics vs vanity metrics. A vanity metric looks good in a slide and moves the way you want without telling you anything actionable (total registered users, cumulative commits). A health metric changes your decisions: if it moves, you do something different tomorrow.

A few cross-cutting ideas show up in every category below:

ConceptWhat it meansExample
Leading indicatorPredicts a future outcome; you can still act on itPR review latency, error-budget burn rate
Lagging indicatorConfirms an outcome after the factQuarterly churn, escaped-defect count
Guardrail / counter-metricA paired metric that goes bad if you game the primary onePair deploy frequency with change failure rate
North Star metricThe single number that best proxies durable customer value"Weekly active teams", "nights booked"

The guardrail is the concrete antidote to Goodhart: you never track a metric alone, you track it against its natural counterweight. Push deployment frequency up and change failure rate is supposed to stay flat — if it climbs, you're shipping faster by shipping worse. Push code coverage up and mutation score should climb with it — if it doesn't, you're writing tests that execute code without asserting anything. The pairing is what makes the number honest.

Keep that frame for the rest of the tour: every metric below is useful as a signal, and almost every one is dangerous as a target.

Delivery & DevOps (DORA)

The DORA metrics (from Google's DevOps Research and Assessment program, popularized by the book Accelerate) are the most battle-tested set of delivery metrics. They split into two throughput measures and two stability measures — and the research's key finding is that the best teams are elite at both, killing the old myth that speed and stability trade off against each other.

MetricQuestion it answersType
Deployment frequencyHow often do we ship to production?Throughput
Lead time for changesCommit → running in production, how long?Throughput
Change failure rate (CFR)What % of deploys cause a failure needing remediation?Stability
Time to restore service (MTTR)How fast do we recover from a failure?Stability

The rough performance bands DORA reports look like this (they drift year to year, so treat them as orders of magnitude, not exact cutoffs):

BandDeploy frequencyLead timeChange failure rateTime to restore
EliteOn-demand (multiple/day)< 1 day0–15%< 1 hour
HighDaily–weekly1 day–1 week16–30%< 1 day
MediumWeekly–monthly1 week–1 month16–30%1 day–1 week
LowMonthly–biannually1–6 months16–30%+1 week–1 month

Elite High Medium Low

The throughput pair (deploy frequency, lead time) and the stability pair (CFR, MTTR) are each other's guardrails: chasing frequency while CFR climbs means you're just shipping bugs faster.

Underneath DORA sits CI/CD pipeline health. Lead time is only as good as the pipeline that produces it, so the delivery machine has its own metrics:

Pipeline metricWhy it matters
Build durationLong builds lengthen lead time and kill flow; the feedback loop should be minutes, not hours
Pipeline success rateA green-main default; chronic red on main blocks everyone
Queue / wait timeTime a job waits for a runner — invisible drag that inflates lead time
Flaky-test rate% of failures that pass on re-run; flakiness trains engineers to ignore red, which is how real failures ship
MTTR for red buildsHow fast a broken main gets back to green — a broken pipeline is an outage for the whole team

Reliability: SLIs, SLOs & SLAs

Reliability metrics come in three layers that people constantly conflate:

TermFull nameWhat it isAudience
SLIService Level IndicatorThe actual measurement, e.g. "% of requests < 300 ms"Engineers
SLOService Level ObjectiveYour internal target for that SLI, e.g. "99.9% over 30 days"Engineering + product
SLAService Level AgreementA contractual promise to customers, with penalties if breachedLegal / customers

The relationship is nested: you measure an SLI, hold yourself to an SLO that is stricter than any SLA you sign, so you have room to react before you owe a refund.

Error budgets turn the SLO into a spending account. If your SLO is 99.9%, you're allowed 0.1% unreliability — that's the budget. Spend it slowly with background noise, or all at once on a risky launch, your call. When the budget is healthy, ship features fast; when it's exhausted, you freeze risky changes and pay down reliability. This is the mechanism that lets speed and stability coexist.

Availability math (the "nines") is worth memorizing because the intuition is non-linear — each nine is ~10× harder:

Availability"Nines"Downtime / yearDowntime / month
99%two nines~3.65 days~7.2 hours
99.9%three nines~8.77 hours~43 minutes
99.99%four nines~52.6 minutes~4.3 minutes
99.999%five nines~5.26 minutes~26 seconds

Latency is measured in percentiles, never averages. An average hides the tail: p50 (median) tells you the typical experience, p95/p99 tell you what your unhappiest users feel. A p99 of 2s means 1 in 100 requests is miserable — and on a page with 100 assets, almost every page load hits it.

Underneath the SLx layer are the raw signals you actually instrument. Three well-known frameworks, each suited to a different thing:

FrameworkSignalsBest for
Four Golden Signals (Google SRE)Latency, Traffic, Errors, SaturationAny user-facing service
REDRate, Errors, DurationRequest-driven microservices
USEUtilization, Saturation, ErrorsResources: CPU, disk, memory, queues

RED is the request-centric view (what your users hit); USE is the resource-centric view (what your machines feel); the Golden Signals are essentially RED plus saturation. Most teams use RED for services and USE for infrastructure.

Finally, alerting has its own quality metrics — because an alert that no one trusts is worse than no alert:

  • MTTA (mean time to acknowledge) — how long from alert firing to a human owning it.
  • Signal-to-noise ratio — % of alerts that were actionable vs pure noise. Chronic false positives cause alert fatigue, where real pages get ignored. A good target is that nearly every page a human wakes up for was worth waking up for.

Productivity & flow

"Developer productivity" is where measurement goes to die, because the tempting metrics are the worst ones. Lines of code, commits per developer, and story points per person are traps — they measure motion, not value. LOC rewards verbosity (the best change is often a deletion); commit count rewards splitting work into confetti; per-person output punishes the collaboration, mentoring, and review that make a team fast. All three fail Goodhart instantly and all three penalize senior engineers, whose highest-value work often produces the least code.

The credible answer is the SPACE framework (GitHub / Microsoft Research), whose core message is that productivity is multi-dimensional and you should pick one metric from several dimensions rather than optimizing any single one:

SPACE dimensionCapturesExample metric
Satisfaction & well-beingAre devs happy and healthy?eNPS, burnout survey
PerformanceOutcome quality, not outputChange failure rate, reliability
ActivityVolume of actions (use with care)PRs, deploys, design docs
Communication & collaborationHow work flows between peopleReview latency, discoverability of docs
Efficiency & flowAbility to work with minimal interruptionCycle time, handoff count

For the day-to-day, flow metrics are the most actionable subset:

  • Cycle time — first commit → in production for a unit of work. The single best flow signal; short cycle time means fast feedback and small batches.
  • Work in progress (WIP) — how many things are in flight at once. High WIP means lots of context-switching and stalled, half-done work; limiting WIP is the fastest way to cut cycle time (Little's Law: cycle time = WIP ÷ throughput).
  • Throughput — items completed per unit time. Useful as a trend, dangerous as a target (it's one keystroke from "count the tickets").

The honest framing: measure flow and outcomes at the team level, never rank individuals by output. As soon as a productivity metric has a name attached to it and a bonus riding on it, it stops measuring productivity.

Developer experience & team health

Where the previous section measured output and flow, this one measures the humans and the environment producing them. DevEx (Developer Experience) is the successor framing to SPACE — same authors, a sharper lens. It argues developer effectiveness is best understood through three feedback loops:

DevEx dimensionThe questionSignals
Feedback loopsHow fast do I learn if my work is good?Build/test time, CI wait, review latency, deploy time
Cognitive loadHow hard is it to get things done?Onboarding time, doc quality, incidental complexity
Flow stateCan I focus without interruption?Uninterrupted focus time, meeting load, context switches

Around those sit the "human" metrics teams routinely forget to measure — they don't show up in a delivery dashboard, but they predict whether the delivery dashboard will still look good in a year:

MetricWhat it tells youWatch for
On-call load & toilManual, repetitive, automatable work as % of time> ~50% toil means no time to improve the system
Attrition / retentionAre people leaving? Regretted vs non-regrettedRegretted attrition is a leading indicator of deeper rot
Onboarding / ramp timeTime to first PR, time to productiveLong ramp = high cognitive load and poor docs
Bus factorHow many people can leave before knowledge is lostA bus factor of 1 on a critical system is a live risk
Burnout & psychological safetySustainable pace; safety to speak up and failMeasured via surveys (eNPS, safety index), not telemetry

Two cautions specific to this category. First, most of these are survey-based, and that's fine — self-reported experience is a legitimate, well-validated measurement; you don't need everything to come from a log. Second, these are the metrics most vulnerable to "we measured it once for a QBR and never acted." Their whole value is as a leading indicator: burnout and a bus factor of 1 are cheap to see and expensive to ignore.

Frontend performance

For anything users load in a browser, performance is measured mostly through Google's Core Web Vitals (LCP, INP, CLS) — plus the supporting signals (TTFB, bundle size), the lab-vs-field-data distinction, client-side reliability, and accessibility that sit alongside them. It's a big enough topic to get its own deep-dive.

→ See the dedicated post: Measuring the Frontend: Metrics That Actually Matter — Core Web Vitals, lab vs field data (Lighthouse / RUM / CrUX), TTFB and bundle size, client-side reliability, accessibility (a11y), the SEO overlap, and the frontend-adjacent product signals.

Quality & bugs

Quality metrics try to answer "how much is broken, and are we finding it before customers do?" The most important is where defects are caught:

MetricDefinitionGood direction
Defect escape rate% of defects found in production vs pre-releaseLower — you want them caught in test, not by users
Escaped-defect densityProduction defects per unit (KLOC, feature, release)Lower and stable over releases
MTTR (bugs)Mean time to resolve a defect once reportedLower
Defect reopen rate% of "fixed" bugs that come backLower — high reopen = superficial fixes

Escape rate is the headline: a team can have many bugs but a low escape rate (great test net) or few reported bugs but a high escape rate (users are your QA). The second is far more dangerous.

Code coverage deserves its own warning. Coverage measures which lines ran during tests — not whether anything was asserted about them. You can have 100% coverage and zero real verification. Treat it as a floor, not a target:

  • Chasing a high coverage number produces tests that call code and assert nothing.
  • Mutation testing is the honest guardrail: it deliberately introduces bugs and checks whether your tests catch them. A high mutation score means your tests actually assert behavior; pair it with coverage so coverage can't be gamed.
  • ~70–80% meaningful coverage on the code that matters beats 95% of everything.

Code churn — how often a file is rewritten shortly after being written — is a quality signal too: high churn on new code often means unclear requirements or thrashing, and churn concentrated in a few files points at fragile, hard-to-get-right areas (which links straight to hotspots in the next section).

Code health & code review

These are often the most actionable metrics on the whole list, because unlike a team-wide average they point at a specific file or a specific habit you can change tomorrow.

Code health metrics describe the codebase itself:

MetricWhat it measuresWhy it matters
Cyclomatic complexityNumber of independent paths through a functionHigh complexity = hard to test, easy to break
Coupling / cohesionHow tangled modules are / how focused each isLow coupling + high cohesion = changes stay local
Churn & hotspotsFiles that change often and are complexThe intersection is where bugs concentrate
Dependency stalenessHow far behind your dependencies areOld deps = security risk + painful upgrades later
Technical debt (ratio)Estimated remediation cost vs cost to buildA trend line for "are we getting better or worse?"

The single most useful move here is the hotspot map: overlay complexity against churn. A file that's complex but never changes is fine; a file that's complex and changes constantly is where your incidents come from. That intersection tells you exactly where refactoring pays off — a far better prioritizer than a global "technical debt score".

Code review metrics describe a habit rather than a file, and small changes here move delivery a lot:

MetricHealthy pattern
PR sizeSmall — big PRs get rubber-stamped; reviewability collapses past a few hundred lines
Time-to-first-reviewShort — a PR waiting on review is blocked WIP inflating cycle time
Time-to-mergeShort and predictable, once approved
Review coverage% of changes that actually got a substantive review, not a drive-by 👍

PR size is the lever most worth pulling: shrink PRs and review latency, review quality, and escaped defects all improve together.

Agile & process health

These metrics describe how work moves through your process. The famous — and most abused — one is velocity: story points completed per sprint.

Velocity's legitimate use is capacity planning for a single, stable team: "this team completes roughly 30 points a sprint, so a 90-point epic is about three sprints." Its illegitimate uses are everything managers reach for: comparing teams (points are relative and team-specific — meaningless across teams), and setting velocity as a target. The moment velocity is a goal, teams inflate estimates and the number rises while nothing ships faster — Goodhart again. Velocity is a planning input, never a performance score.

The healthier process metrics focus on predictability and flow, not raw speed:

MetricWhat it showsRead it for
Sprint predictabilityCommitted vs completed, sprint over sprintCan we be trusted to deliver what we forecast?
Burndown chartRemaining work over the sprintAre we on track, or is work discovered late?
Cumulative flow diagram (CFD)Count of items in each state over timeBottlenecks — a widening band = a stage backing up

Predictability beats velocity as a health signal: a team that reliably delivers 20 points is more valuable than one that swings between 10 and 40. And the cumulative flow diagram is the most diagnostic of the three — when the "In Progress" band keeps widening while "Done" grows slowly, you've found your bottleneck without needing anyone to report it.

Product & user engagement

Engineering metrics tell you the machine runs well; product metrics tell you whether anyone cares. A fast, reliable product nobody uses is still a failure, so these close the loop back to value.

MetricDefinitionSignals
DAU / MAUDaily / Monthly Active UsersReach and habit
StickinessDAU ÷ MAUHow many monthly users show up daily — habit strength
Retention (cohort)% of a signup cohort still active after N days/weeksThe truest measure of product-market fit
Churn% of users/revenue lost per periodThe inverse of retention; compounds brutally
Activation% of new users reaching the "aha" momentWhether onboarding actually delivers value
Feature adoption% of users using a given featureWhether what you built was worth building
Session duration & depthTime per session, pages/actions per sessionEngagement — context-dependent (see below)
Bounce / exit rate% leaving after one page / where they leaveFriction and dead ends
NPS / CSATNet Promoter Score / Customer SatisfactionSentiment, self-reported

Two things to keep honest here. First, stickiness (DAU/MAU) and retention are the ones that matter most — they measure durable value, whereas raw DAU can be pumped with notifications and vanity growth. Second, engagement metrics are direction-ambiguous: a longer session is good for a game or social app and bad for a checkout flow or support tool, where the goal is to get users done fast. Always interpret them against what the product is for — never assume "more time on site" is a win.

Cohort retention curves are the single most revealing product view: plot each signup cohort's survival over time. A curve that flattens to a horizontal asymptote means you've found a durable core of users (product-market fit); a curve that decays to zero means you're filling a leaky bucket no matter how much you spend on acquisition.

SEO & marketing

These sit at the boundary between engineering and growth, and engineers own more of them than they realize — because a chunk of SEO is frontend performance.

Acquisition / SEO metrics — how people find you:

MetricWhat it measures
Organic trafficVisitors arriving from unpaid search
ImpressionsHow often you appear in search results
Keyword rankingsYour position for target search terms
Click-through rate (CTR)Clicks ÷ impressions — how compelling your result is

The engineering hook: Core Web Vitals are a Google ranking factor. The LCP / INP / CLS thresholds (see Measuring the Frontend) aren't only a UX concern — poor field CWV can suppress your ranking, which suppresses impressions, which suppresses organic traffic. Frontend performance work is SEO work.

Conversion / economics — whether that traffic turns into a business:

MetricDefinition
Conversion rate% of visitors completing the goal action (signup, purchase)
Funnel drop-offWhere in the multi-step flow users abandon
CACCustomer Acquisition Cost — total spend ÷ new customers
LTVLifetime Value — total revenue expected from a customer

The relationship that decides whether the whole thing is a business is the LTV:CAC ratio. A customer must be worth more than it costs to acquire them; a common rule of thumb is LTV:CAC ≥ 3:1. Below ~1:1 you lose money on every customer and growth just accelerates the losses — one of the clearest examples of a technical/product metric (retention → LTV) tying directly to a survival-level business number.

Security & DevSecOps

The DevSecOps shift is to treat security as a continuous metric woven through the pipeline, not a gate at the end. The "shift-left" framing means catching issues while code is being written — where they're cheap — instead of in a pre-release pentest, where they're expensive and late.

MetricDefinitionGood direction
MTTDMean Time To Detect a vulnerability or breachLower
MTTR / mean time to patchDetect → remediated/patchedLower
Vulnerability densityKnown vulns per unit of codeLower
CVE / CVSS exposureCount and severity of known vulns (CVEs), scored by CVSSFewer, lower severity
% of code & dependencies scannedCoverage of SAST / dependency scanningHigher — blind spots hide risk
Secrets-leak rateCredentials/keys committed to the repoZero, caught pre-commit
Security debtBacklog of known-but-unfixed findings, weighted by severityTrending down

A few definitions worth pinning down: a CVE (Common Vulnerabilities and Exposures) is a publicly catalogued vulnerability; CVSS (Common Vulnerability Scoring System) is the 0–10 severity score attached to it. SAST (Static Application Security Testing) scans your source; DAST scans the running app; SCA (Software Composition Analysis) scans your dependencies for known CVEs.

The two metrics that matter most are MTTD and mean time to patch. A vulnerability's danger is roughly proportional to how long it sits exposed — the window between "a fix exists" and "we deployed it" is your real risk surface. Time-to-patch is the security analog of DORA's time-to-restore, and elite security orgs measure it in hours for critical CVEs, not weeks.

Coverage metrics (% scanned) are the guardrail: an impressive "0 known vulnerabilities" is meaningless if only 20% of the code and none of the dependencies were scanned — you're not secure, you're just not looking.

Cost & efficiency (FinOps)

FinOps brings a financial lens to engineering: it makes cloud spend a first-class engineering metric that developers can see and influence, rather than a bill finance discovers at month-end. This is the category that ties every other metric back to a dollar figure, which is what makes it the bridge to business outcomes.

MetricDefinitionWhy it matters
Cloud spend trendTotal infrastructure cost over timeThe top-line; catch runaway growth early
Unit economicsCost per request / per user / per transactionThe one that scales — normalizes cost against usage
Utilization vs waste% of provisioned resources actually usedIdle capacity and orphaned resources are pure waste
Cost per deployInfrastructure cost attributable to shippingTies delivery velocity to its real price

The standout is unit economics. Absolute cloud spend should grow as you grow — that's not a problem. The question is whether cost per user (or per request) is flat, falling, or rising. Falling unit cost means you're scaling efficiently; rising unit cost means every new customer makes the economics worse, and no amount of growth fixes that. It's the FinOps equivalent of the LTV:CAC check.

This is also where a technical metric becomes a business argument: "our p99 latency dropped" is an engineering claim; "we cut cost-per-request 40% while holding p99" is a business-level result. Expressing efficiency work in unit-economic terms is how engineering reliability and performance improvements get funded.

Putting it together

You've now seen well over a hundred metrics across a dozen categories. The mistake would be to track all of them. A dashboard with 100 numbers is a dashboard nobody reads; the skill is choosing a small, balanced set that tells the truth about your specific system.

A few principles pulled from everything above:

  1. Pick a balanced few, not an exhaustive many. Cover a handful of dimensions — delivery, reliability, quality, developer experience, and a product/business outcome — with one or two metrics each. DORA's four keys plus an SLO plus a North Star is a better dashboard than fifty widgets.

  2. Never optimize a single metric. Every metric in this post is dangerous alone. The most reliable way to make a system worse is to pick one number and push it hard.

  3. Always pair a metric with its guardrail. This is the concrete defense against Goodhart, and it's the recurring theme of the whole tour:

    Primary metricGuardrail / counter-metric
    Deployment frequencyChange failure rate
    Velocity / throughputSprint predictability, quality
    Code coverageMutation score
    Cloud spend reductionReliability (SLO), latency
    Feature shipping speedEscaped-defect rate, DevEx
  4. Tie everything to an outcome. For each metric, ask: if this moves, what decision do we make, and what does the customer or business get? If there's no answer, it's a vanity metric — stop tracking it.

  5. Measure teams and systems, not individuals. Every metric that gets attached to a person's name and a bonus stops measuring the thing and starts measuring the incentive.

The whole point of measurement is better decisions, not better dashboards. A metric earns its place only if it changes what you do — everything else is decoration. Choose few, pair each with its counter, tie each to an outcome, and re-examine the set whenever it stops telling you something you didn't already know.