Software Engineering Metrics That Matter: DORA, Delivery, and Reliability

June 15, 2026 · 25 min read

Software Engineer

A row of delivery-pipeline icons — a gear, a cloud upload, a shield, and two gauges — with a rising arrow, in indigo

A tour of the metrics that describe how software teams actually perform — grouped by what they measure: delivery, reliability, productivity, frontend performance, quality, and agile process. What each one means, when it's useful, and how it goes wrong.

Where a category has a concrete metric set, the table carries the tools that produce it, tagged by kind: oss open-source you host (Prometheus, Grafana, OpenTelemetry) · saas paid observability SaaS (Datadog, New Relic, Honeycomb) · platform a tool you already use, read natively (GitHub, GitLab, Jira) · specialist a dedicated point tool (LinearB, SonarQube, Snyk, PagerDuty, Kubecost). The frontend, product, and SEO/marketing sections defer their tooling to the dedicated deep-dives linked from each.

Why metrics (and why they mislead)

A metric is a signal, not a target. The moment you turn a signal into a goal people are rewarded for, they optimize the number — and the number stops describing reality. This is Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Measure developers by lines of code and you get bloated code; measure by tickets closed and tickets get smaller and more numerous. The metric goes up while the thing you actually cared about stays flat or gets worse.

So the first distinction to keep straight is health metrics vs vanity metrics. A vanity metric looks good in a slide and moves the way you want without telling you anything actionable (total registered users, cumulative commits). A health metric changes your decisions: if it moves, you do something different tomorrow.

A few cross-cutting ideas show up in every category below:

Concept	What it means	Example
Leading indicator	Predicts a future outcome; you can still act on it	PR review latency, error-budget burn rate
Lagging indicator	Confirms an outcome after the fact	Quarterly churn, escaped-defect count
Guardrail / counter-metric	A paired metric that goes bad if you game the primary one	Pair deploy frequency with change failure rate
North Star metric	The single number that best proxies durable customer value	"Weekly active teams", "nights booked"

The guardrail is the concrete antidote to Goodhart: you never track a metric alone, you track it against its natural counterweight. Push deployment frequency up and change failure rate is supposed to stay flat — if it climbs, you're shipping faster by shipping worse. Push code coverage up and mutation score should climb with it — if it doesn't, you're writing tests that execute code without asserting anything. The pairing is what makes the number honest.

Keep that frame for the rest of the tour: every metric below is useful as a signal, and almost every one is dangerous as a target.

Delivery & DevOps (DORA)

The DORA metrics (from Google's DevOps Research and Assessment program, popularized by the book Accelerate) are the most battle-tested set of delivery metrics. They split into two throughput measures and two stability measures — and the research's key finding is that the best teams are elite at both, killing the old myth that speed and stability trade off against each other.

Metric	Question it answers	Type	Tools
Deployment frequency	How often do we ship to production?	Throughput	GitHub, GitLab deployments (`platform`); LinearB, Sleuth, DX (`specialist`)
Lead time for changes	Commit → running in production, how long?	Throughput	LinearB, Jellyfish, Sleuth (`specialist`); GitHub/GitLab (`platform`)
Change failure rate (CFR)	What % of deploys cause a failure needing remediation?	Stability	Deploy tracker + incident tool (Sleuth, LinearB `specialist`)
Time to restore service (MTTR)	How fast do we recover from a failure?	Stability	PagerDuty, Opsgenie, incident.io (`specialist`)

The rough performance bands DORA reports look like this (they drift year to year, so treat them as orders of magnitude, not exact cutoffs):

Band	Deploy frequency	Lead time	Change failure rate	Time to restore
Elite	On-demand (multiple/day)	< 1 day	0–15%	< 1 hour
High	Daily–weekly	1 day–1 week	16–30%	< 1 day
Medium	Weekly–monthly	1 week–1 month	16–30%	1 day–1 week
Low	Monthly–biannually	1–6 months	16–30%+	1 week–1 month

Elite High Medium Low

The throughput pair (deploy frequency, lead time) and the stability pair (CFR, MTTR) are each other's guardrails: chasing frequency while CFR climbs means you're just shipping bugs faster.

Underneath DORA sits CI/CD pipeline health. Lead time is only as good as the pipeline that produces it, so the delivery machine has its own metrics:

Pipeline metric	Why it matters	Tools
Build duration	Long builds lengthen lead time and kill flow; the feedback loop should be minutes, not hours	CI platform analytics: GitHub Actions, GitLab CI, CircleCI, Jenkins (`platform`)
Pipeline success rate	A green-main default; chronic red on `main` blocks everyone	CI platform insights (`platform`)
Queue / wait time	Time a job waits for a runner — invisible drag that inflates lead time	CI platform runner metrics (`platform`)
Flaky-test rate	% of failures that pass on re-run; flakiness trains engineers to ignore red, which is how real failures ship	Trunk, BuildPulse, CircleCI test insights (`specialist`)
MTTR for red builds	How fast a broken `main` gets back to green — a broken pipeline is an outage for the whole team	CI platform + alerting (`platform`/`specialist`)

Reliability: SLIs, SLOs & SLAs

Reliability metrics come in three layers that people constantly conflate:

Term	Full name	What it is	Audience
SLI	Service Level Indicator	The actual measurement, e.g. "% of requests < 300 ms"	Engineers
SLO	Service Level Objective	Your internal target for that SLI, e.g. "99.9% over 30 days"	Engineering + product
SLA	Service Level Agreement	A contractual promise to customers, with penalties if breached	Legal / customers

The relationship is nested: you measure an SLI, hold yourself to an SLO that is stricter than any SLA you sign, so you have room to react before you owe a refund.

Error budgets turn the SLO into a spending account. If your SLO is 99.9%, you're allowed 0.1% unreliability — that's the budget. Spend it slowly with background noise, or all at once on a risky launch, your call. When the budget is healthy, ship features fast; when it's exhausted, you freeze risky changes and pay down reliability. This is the mechanism that lets speed and stability coexist.

Availability math (the "nines") is worth memorizing because the intuition is non-linear — each nine is ~10× harder:

Availability	"Nines"	Downtime / year	Downtime / month
99%	two nines	~3.65 days	~7.2 hours
99.9%	three nines	~8.77 hours	~43 minutes
99.99%	four nines	~52.6 minutes	~4.3 minutes
99.999%	five nines	~5.26 minutes	~26 seconds

Latency is measured in percentiles, never averages. An average hides the tail: p50 (median) tells you the typical experience, p95/p99 tell you what your unhappiest users feel. A p99 of 2s means 1 in 100 requests is miserable — and on a page with 100 assets, almost every page load hits it.

Underneath the SLx layer are the raw signals you actually instrument. Three well-known frameworks, each suited to a different thing:

Framework	Signals	Best for
Four Golden Signals (Google SRE)	Latency, Traffic, Errors, Saturation	Any user-facing service
RED	Rate, Errors, Duration	Request-driven microservices
USE	Utilization, Saturation, Errors	Resources: CPU, disk, memory, queues

RED is the request-centric view (what your users hit); USE is the resource-centric view (what your machines feel); the Golden Signals are essentially RED plus saturation. Most teams use RED for services and USE for infrastructure.

Measure with: an observability stack — Prometheus + Grafana or OpenTelemetry (oss), or Datadog, New Relic, Honeycomb (saas) — collects these signals and backs the SLIs above.

Finally, alerting has its own quality metrics — because an alert that no one trusts is worse than no alert:

MTTA (mean time to acknowledge) — how long from alert firing to a human owning it.
Signal-to-noise ratio — % of alerts that were actionable vs pure noise. Chronic false positives cause alert fatigue, where real pages get ignored. A good target is that nearly every page a human wakes up for was worth waking up for.

Measure with: the on-call platform — PagerDuty, Opsgenie, incident.io (specialist) — which reports MTTA and alert volume natively.

Productivity & flow

"Developer productivity" is where measurement goes to die, because the tempting metrics are the worst ones. Lines of code, commits per developer, and story points per person are traps — they measure motion, not value. LOC rewards verbosity (the best change is often a deletion); commit count rewards splitting work into confetti; per-person output punishes the collaboration, mentoring, and review that make a team fast. All three fail Goodhart instantly and all three penalize senior engineers, whose highest-value work often produces the least code.

The credible answer is the SPACE framework (GitHub / Microsoft Research), whose core message is that productivity is multi-dimensional and you should pick one metric from several dimensions rather than optimizing any single one:

SPACE dimension	Captures	Example metric
Satisfaction & well-being	Are devs happy and healthy?	eNPS, burnout survey
Performance	Outcome quality, not output	Change failure rate, reliability
Activity	Volume of actions (use with care)	PRs, deploys, design docs
Communication & collaboration	How work flows between people	Review latency, discoverability of docs
Efficiency & flow	Ability to work with minimal interruption	Cycle time, handoff count

For the day-to-day, flow metrics are the most actionable subset:

Cycle time — first commit → in production for a unit of work. The single best flow signal; short cycle time means fast feedback and small batches.
Work in progress (WIP) — how many things are in flight at once. High WIP means lots of context-switching and stalled, half-done work; limiting WIP is the fastest way to cut cycle time (Little's Law: cycle time = WIP ÷ throughput).
Throughput — items completed per unit time. Useful as a trend, dangerous as a target (it's one keystroke from "count the tickets").

The honest framing: measure flow and outcomes at the team level, never rank individuals by output. As soon as a productivity metric has a name attached to it and a bonus riding on it, it stops measuring productivity. (Measuring people without corrupting their behavior is a whole topic of its own — see Measuring the Immeasurable.)

Developer experience & team health

Where the previous section measured output and flow, this one measures the humans and the environment producing them. DevEx (Developer Experience) is the successor framing to SPACE — same authors, a sharper lens. It argues developer effectiveness is best understood through three feedback loops:

DevEx dimension	The question	Signals
Feedback loops	How fast do I learn if my work is good?	Build/test time, CI wait, review latency, deploy time
Cognitive load	How hard is it to get things done?	Onboarding time, doc quality, incidental complexity
Flow state	Can I focus without interruption?	Uninterrupted focus time, meeting load, context switches

Around those sit the "human" metrics teams routinely forget to measure — they don't show up in a delivery dashboard, but they predict whether the delivery dashboard will still look good in a year:

Metric	What it tells you	Watch for
On-call load & toil	Manual, repetitive, automatable work as % of time	> ~50% toil means no time to improve the system
Attrition / retention	Are people leaving? Regretted vs non-regretted	Regretted attrition is a leading indicator of deeper rot
Onboarding / ramp time	Time to first PR, time to productive	Long ramp = high cognitive load and poor docs
Bus factor	How many people can leave before knowledge is lost	A bus factor of 1 on a critical system is a live risk
Burnout & psychological safety	Sustainable pace; safety to speak up and fail	Measured via surveys (eNPS, safety index), not telemetry

Two cautions specific to this category. First, most of these are survey-based, and that's fine — self-reported experience is a legitimate, well-validated measurement; you don't need everything to come from a log. Second, these are the metrics most vulnerable to "we measured it once for a QBR and never acted." Their whole value is as a leading indicator: burnout and a bus factor of 1 are cheap to see and expensive to ignore.

Frontend performance

For anything users load in a browser, performance is measured mostly through Google's Core Web Vitals (LCP, INP, CLS) — plus the supporting signals (TTFB, bundle size), the lab-vs-field-data distinction, client-side reliability, and accessibility that sit alongside them. It's a big enough topic to get its own deep-dive.

→ See the dedicated post: Measuring the Frontend: Metrics That Actually Matter — Core Web Vitals, lab vs field data (Lighthouse / RUM / CrUX), TTFB and bundle size, client-side reliability, and accessibility (a11y). The product and SEO/marketing signals it borders on get their own deep-dives, linked from the Product and SEO & marketing sections below.

Quality & bugs

Quality metrics try to answer "how much is broken, and are we finding it before customers do?" The most important is where defects are caught:

Metric	Definition	Good direction	Tools
Defect escape rate	% of defects found in production vs pre-release	Lower — you want them caught in test, not by users	Issue tracker: Jira, Linear (`platform`); Sentry for production defects (`specialist`)
Escaped-defect density	Production defects per unit (KLOC, feature, release)	Lower and stable over releases	Jira / Linear reports (`platform`)
MTTR (bugs)	Mean time to resolve a defect once reported	Lower	Jira, Linear (`platform`); Sentry (`specialist`)
Defect reopen rate	% of "fixed" bugs that come back	Lower — high reopen = superficial fixes	Issue-tracker workflow reports (`platform`)

Escape rate is the headline: a team can have many bugs but a low escape rate (great test net) or few reported bugs but a high escape rate (users are your QA). The second is far more dangerous.

Code coverage deserves its own warning. Coverage measures which lines ran during tests — not whether anything was asserted about them. You can have 100% coverage and zero real verification. Treat it as a floor, not a target:

Chasing a high coverage number produces tests that call code and assert nothing.
Mutation testing is the honest guardrail: it deliberately introduces bugs and checks whether your tests catch them. A high mutation score means your tests actually assert behavior; pair it with coverage so coverage can't be gamed.
~70–80% meaningful coverage on the code that matters beats 95% of everything.

Code churn — how often a file is rewritten shortly after being written — is a quality signal too: high churn on new code often means unclear requirements or thrashing, and churn concentrated in a few files points at fragile, hard-to-get-right areas (which links straight to hotspots in the next section).

Code health & code review

These are often the most actionable metrics on the whole list, because unlike a team-wide average they point at a specific file or a specific habit you can change tomorrow.

Code health metrics describe the codebase itself:

Metric	What it measures	Why it matters	Tools
Cyclomatic complexity	Number of independent paths through a function	High complexity = hard to test, easy to break	SonarQube, Code Climate (`specialist`)
Coupling / cohesion	How tangled modules are / how focused each is	Low coupling + high cohesion = changes stay local	SonarQube, CodeScene (`specialist`)
Churn & hotspots	Files that change often and are complex	The intersection is where bugs concentrate	CodeScene (`specialist`); `git` history (`oss`)
Dependency staleness	How far behind your dependencies are	Old deps = security risk + painful upgrades later	Dependabot, Renovate (`platform`)
Technical debt (ratio)	Estimated remediation cost vs cost to build	A trend line for "are we getting better or worse?"	SonarQube, Code Climate (`specialist`)

The single most useful move here is the hotspot map: overlay complexity against churn. A file that's complex but never changes is fine; a file that's complex and changes constantly is where your incidents come from. That intersection tells you exactly where refactoring pays off — a far better prioritizer than a global "technical debt score".

Code review metrics describe a habit rather than a file, and small changes here move delivery a lot:

Metric	Healthy pattern	Tools
PR size	Small — big PRs get rubber-stamped; reviewability collapses past a few hundred lines	GitHub, GitLab (`platform`); LinearB, Graphite (`specialist`)
Time-to-first-review	Short — a PR waiting on review is blocked WIP inflating cycle time	LinearB, Graphite, Pull Panda (`specialist`)
Time-to-merge	Short and predictable, once approved	GitHub/GitLab insights (`platform`); LinearB (`specialist`)
Review coverage	% of changes that actually got a substantive review, not a drive-by 👍	GitHub/GitLab (`platform`)

PR size is the lever most worth pulling: shrink PRs and review latency, review quality, and escaped defects all improve together.

Agile & process health

These metrics describe how work moves through your process. The famous — and most abused — one is velocity: story points completed per sprint.

Velocity's legitimate use is capacity planning for a single, stable team: "this team completes roughly 30 points a sprint, so a 90-point epic is about three sprints." Its illegitimate uses are everything managers reach for: comparing teams (points are relative and team-specific — meaningless across teams), and setting velocity as a target. The moment velocity is a goal, teams inflate estimates and the number rises while nothing ships faster — Goodhart again. Velocity is a planning input, never a performance score.

The healthier process metrics focus on predictability and flow, not raw speed:

Metric	What it shows	Read it for	Tools
Sprint predictability	Committed vs completed, sprint over sprint	Can we be trusted to deliver what we forecast?	Jira, Linear, Azure Boards (`platform`)
Burndown chart	Remaining work over the sprint	Are we on track, or is work discovered late?	Jira, Linear (`platform`)
Cumulative flow diagram (CFD)	Count of items in each state over time	Bottlenecks — a widening band = a stage backing up	Jira, Azure Boards (`platform`); Actionable Agile (`specialist`)

Predictability beats velocity as a health signal: a team that reliably delivers 20 points is more valuable than one that swings between 10 and 40. And the cumulative flow diagram is the most diagnostic of the three — when the "In Progress" band keeps widening while "Done" grows slowly, you've found your bottleneck without needing anyone to report it.

Product & user engagement

Engineering metrics tell you the machine runs well; product metrics tell you whether anyone cares. A fast, reliable product nobody uses is still a failure, so these close the loop back to value.

Metric	Definition	Signals
DAU / MAU	Daily / Monthly Active Users	Reach and habit
Stickiness	DAU ÷ MAU	How many monthly users show up daily — habit strength
Retention (cohort)	% of a signup cohort still active after N days/weeks	The truest measure of product-market fit
Churn	% of users/revenue lost per period	The inverse of retention; compounds brutally
Activation	% of new users reaching the "aha" moment	Whether onboarding actually delivers value
Feature adoption	% of users using a given feature	Whether what you built was worth building
Session duration & depth	Time per session, pages/actions per session	Engagement — context-dependent (see below)
Bounce / exit rate	% leaving after one page / where they leave	Friction and dead ends
NPS / CSAT	Net Promoter Score / Customer Satisfaction	Sentiment, self-reported

Two things to keep honest here. First, stickiness (DAU/MAU) and retention are the ones that matter most — they measure durable value, whereas raw DAU can be pumped with notifications and vanity growth. Second, engagement metrics are direction-ambiguous: a longer session is good for a game or social app and bad for a checkout flow or support tool, where the goal is to get users done fast. Always interpret them against what the product is for — never assume "more time on site" is a win.

Cohort retention curves are the single most revealing product view: plot each signup cohort's survival over time. A curve that flattens to a horizontal asymptote means you've found a durable core of users (product-market fit); a curve that decays to zero means you're filling a leaky bucket no matter how much you spend on acquisition.

→ See the dedicated post: Measuring Product: Metrics That Actually Matter — conversion & funnel, engagement, activation, feature adoption, stickiness, retention curves, and the North Star.

SEO & marketing

These sit at the boundary between engineering and growth, and engineers own more of them than they realize — because a chunk of SEO is frontend performance.

Acquisition / SEO metrics — how people find you:

Metric	What it measures
Organic traffic	Visitors arriving from unpaid search
Impressions	How often you appear in search results
Keyword rankings	Your position for target search terms
Click-through rate (CTR)	Clicks ÷ impressions — how compelling your result is

The engineering hook: Core Web Vitals are a Google ranking factor. The LCP / INP / CLS thresholds (see Measuring the Frontend) aren't only a UX concern — poor field CWV can suppress your ranking, which suppresses impressions, which suppresses organic traffic. Frontend performance work is SEO work.

Conversion / economics — whether that traffic turns into a business:

Metric	Definition
Conversion rate	% of visitors completing the goal action (signup, purchase)
Funnel drop-off	Where in the multi-step flow users abandon
CAC	Customer Acquisition Cost — total spend ÷ new customers
LTV	Lifetime Value — total revenue expected from a customer

The relationship that decides whether the whole thing is a business is the LTV:CAC ratio. A customer must be worth more than it costs to acquire them; a common rule of thumb is LTV:CAC ≥ 3:1. Below ~1:1 you lose money on every customer and growth just accelerates the losses — one of the clearest examples of a technical/product metric (retention → LTV) tying directly to a survival-level business number.

→ See the dedicated post: Measuring Marketing & SEO: Metrics That Actually Matter — the frontend-owned SEO slice, organic traffic / impressions / rankings / CTR, channel mix, CAC, LTV, the LTV:CAC ratio, attribution, and ROAS.

Security & DevSecOps

The DevSecOps shift is to treat security as a continuous metric woven through the pipeline, not a gate at the end. The "shift-left" framing means catching issues while code is being written — where they're cheap — instead of in a pre-release pentest, where they're expensive and late.

Metric	Definition	Good direction	Tools
MTTD	Mean Time To Detect a vulnerability or breach	Lower	Snyk, GitHub Advanced Security (`specialist`/`platform`)
MTTR / mean time to patch	Detect → remediated/patched	Lower	Dependabot, Renovate (`platform`); Snyk (`specialist`)
Vulnerability density	Known vulns per unit of code	Lower	SonarQube, Snyk (`specialist`)
CVE / CVSS exposure	Count and severity of known vulns (CVEs), scored by CVSS	Fewer, lower severity	Snyk, Trivy (`oss`/`specialist`)
% of code & dependencies scanned	Coverage of SAST / dependency scanning	Higher — blind spots hide risk	GitHub Advanced Security (`platform`); Snyk, Semgrep (`specialist`)
Secrets-leak rate	Credentials/keys committed to the repo	Zero, caught pre-commit	GitGuardian, TruffleHog, GitHub secret scanning (`specialist`/`platform`)
Security debt	Backlog of known-but-unfixed findings, weighted by severity	Trending down	Snyk, SonarQube dashboards (`specialist`)

A few definitions worth pinning down: a CVE (Common Vulnerabilities and Exposures) is a publicly catalogued vulnerability; CVSS (Common Vulnerability Scoring System) is the 0–10 severity score attached to it. SAST (Static Application Security Testing) scans your source; DAST scans the running app; SCA (Software Composition Analysis) scans your dependencies for known CVEs.

The two metrics that matter most are MTTD and mean time to patch. A vulnerability's danger is roughly proportional to how long it sits exposed — the window between "a fix exists" and "we deployed it" is your real risk surface. Time-to-patch is the security analog of DORA's time-to-restore, and elite security orgs measure it in hours for critical CVEs, not weeks.

Coverage metrics (% scanned) are the guardrail: an impressive "0 known vulnerabilities" is meaningless if only 20% of the code and none of the dependencies were scanned — you're not secure, you're just not looking.

Cost & efficiency (FinOps)

FinOps brings a financial lens to engineering: it makes cloud spend a first-class engineering metric that developers can see and influence, rather than a bill finance discovers at month-end. This is the category that ties every other metric back to a dollar figure, which is what makes it the bridge to business outcomes.

Metric	Definition	Why it matters	Tools
Cloud spend trend	Total infrastructure cost over time	The top-line; catch runaway growth early	AWS Cost Explorer, GCP/Azure billing (`platform`); CloudHealth, Vantage (`specialist`)
Unit economics	Cost per request / per user / per transaction	The one that scales — normalizes cost against usage	Vantage, CloudZero (`specialist`)
Utilization vs waste	% of provisioned resources actually used	Idle capacity and orphaned resources are pure waste	Kubecost, AWS Compute Optimizer (`specialist`/`platform`)
Cost per deploy	Infrastructure cost attributable to shipping	Ties delivery velocity to its real price	Cost-allocation tags in CloudHealth, Vantage (`specialist`)

The standout is unit economics. Absolute cloud spend should grow as you grow — that's not a problem. The question is whether cost per user (or per request) is flat, falling, or rising. Falling unit cost means you're scaling efficiently; rising unit cost means every new customer makes the economics worse, and no amount of growth fixes that. It's the FinOps equivalent of the LTV:CAC check.

This is also where a technical metric becomes a business argument: "our p99 latency dropped" is an engineering claim; "we cut cost-per-request 40% while holding p99" is a business-level result. Expressing efficiency work in unit-economic terms is how engineering reliability and performance improvements get funded.

Putting it together

You've now seen well over a hundred metrics across a dozen categories. The mistake would be to track all of them. A dashboard with 100 numbers is a dashboard nobody reads; the skill is choosing a small, balanced set that tells the truth about your specific system.

A few principles pulled from everything above:

Pick a balanced few, not an exhaustive many. Cover a handful of dimensions — delivery, reliability, quality, developer experience, and a product/business outcome — with one or two metrics each. DORA's four keys plus an SLO plus a North Star is a better dashboard than fifty widgets.
Never optimize a single metric. Every metric in this post is dangerous alone. The most reliable way to make a system worse is to pick one number and push it hard.

Always pair a metric with its guardrail. This is the concrete defense against Goodhart, and it's the recurring theme of the whole tour:

Primary metric	Guardrail / counter-metric
Deployment frequency	Change failure rate
Velocity / throughput	Sprint predictability, quality
Code coverage	Mutation score
Cloud spend reduction	Reliability (SLO), latency
Feature shipping speed	Escaped-defect rate, DevEx

Tie everything to an outcome. For each metric, ask: if this moves, what decision do we make, and what does the customer or business get? If there's no answer, it's a vanity metric — stop tracking it.
Measure teams and systems, not individuals. Every metric that gets attached to a person's name and a bonus stops measuring the thing and starts measuring the incentive.

The whole point of measurement is better decisions, not better dashboards. A metric earns its place only if it changes what you do — everything else is decoration. Choose few, pair each with its counter, tie each to an outcome, and re-examine the set whenever it stops telling you something you didn't already know.

A dashboard of software engineering metrics

Why metrics (and why they mislead)​

Delivery & DevOps (DORA)​

Reliability: SLIs, SLOs & SLAs​

Productivity & flow​

Developer experience & team health​

Frontend performance​

Quality & bugs​

Code health & code review​

Agile & process health​

Product & user engagement​

SEO & marketing​

Security & DevSecOps​

Cost & efficiency (FinOps)​

Putting it together​

References​