Key metrics

You can’t transform something you don’t understand.

If you don’t know and understand what the current state of the customer experience is, how can you possibly design the desired
future state?”

Annette Franz
Founder and CEO of CX Journey

While there is no single set of metrics that every business needs to drive observability operations, every organization should track four basic categories of metrics to achieve observability.

Application metrics

Infrastructure metrics

Software delivery metrics

Business metrics

Some, like application and infrastructure metrics, are obvious to most teams.Application metrics
The most basic types of observability metrics are those associated with approaches like the RED Method or Google’s Four Golden Signals.They boil down to tracking performance at the application level by monitoring:

Request rates, which track how many requests the application receives in a given period.

Latency, which measures how quickly the application responds to requests.

Errors, which record how often a request results in an error.

When used for monitoring, these metrics help identify issues like an application that has become slow to respond or that is generating a higher-than-usual volume of errors.In the context of observability, however, these metrics can be taken further to provide deeper insight into performance issues.For instance, if you can correlate a spike in errors with an increase in request rates, it’s more likely than not that the application is generating errors because it is receiving more requests than it can handle. The solution in that case would probably be to spin up more instances of the application.

Real User Monitoring (RUM)
is a capability that helps observers understand how real users interact and experience a digital interface (web application, mobile application) and whether or not their experience is satisfactory. It is often used as a starting point for problem detection and diagnosis.

^{The Essential Guide to Observability}

Infrastructure metrics

You can gain further observability insights by tracking infrastructure metrics. The exact metrics to work with here depend on how your application is hosted – whether it runs in the cloud or on-premises, for example, and whether it’s orchestrated via Kubernetes. But in general, you’ll want to track:

CPU usage, based on total CPU availability.

Memory usage, based on total memory availability.

In a distributed infrastructure like a Kubernetes cluster, you should also track the total number of nodes and changes in node state in order to ensure that you get ahead of issues such as a lack of available nodes.To apply these metrics to observability, you should correlate them with other data points. An exhaustion of CPU and memory resources that occurs at the same time as an increase in application error rates may mean that the application is dropping requests due to lack of infrastructure resources, for example.

^{The Essential Guide to Observability}

Software delivery metrics

Delivering quality software quickly is a key predictor of an organization’s productivity, profitability, and customer satisfaction. However, siloed tool data blurs the view of how your organization is operating from a software delivery and DevOps perspective.

Observability is the backbone of your continuous integration (CI) pipelines to ensure the health, performance and reliability of the applications at each phase of the software delivery process. The 2021 DORA State of DevOps Report highlights five metrics that are key indicators of organizational software delivery and operational performance:

Deployment frequency	How often does your organization deploy code to production?
Lead time for changes	The time to go from code committed to code running in production.
Time to restore service	The time to restore service when an incident or defect occurs.
Change failure rate	A percentage of changes to production resulting in degraded service.
Reliability	Degree to which the software operates reliably—measured by SLIs, SLOs, and error budgets.

^{The Essential Guide to Observability}

Business metrics

Finally, consider tracking business metrics, meaning metrics that align technical goals with business goals. Examples include:

How often, on average, each customer or user logs into your app.
How much revenue you’ve lost due to application failures.
How average revenue changes following the introduction of a new application feature or performance optimization.

Additionally, metrics such as SLAs, SLIs, and SLOs, while critical in software delivery, are also important to the business because they represent the promise you make to users about the reliability of your service. Will your application be accessible when they need it? Is it performing as expected? These metrics help asses the direct business impact of observability efforts.

SLA

Service Level Agreement

The formal agreement between you and your customers about the performance of your service.

SLO

Service Level Objective

The measurements that define your service’s performance in support of the SLA, i.e., availability or meantime to respond.

SLI

Service Level Indicator

How your service is really performing, i.e. are you keeping your promise to your customers, outlined in your SLAs?

^{The Essential Guide to Observability}

Your people stack is as critical as your tech stack

Data silos and hoarding of information have no place in organizations leaning into observability.

Make sure all stakeholders become “observers,” who can access software that helps them track observability metrics. Avoid allowing any one team to “own” a particular tool or the data it generates. When observability and the insights it delivers are shared across the team, it becomes much easier to establish a feeling of shared responsibility--people are collectively invested in understanding the whys of something not working, than simply that it is not working. As this mentality pushes further left, teams begin to design, build, release—and resolve—collectively, spurring greater efficiency, reliability, and even innovation.

Observability begs a specifictype of culture

Teams can identify helpful practices that foster information flow and trust by examining the six aspects of Westrum's model of organizational culture, focusing on those behaviors seen in the generative culture:

High cooperation
Messengers are trained
Risks are shared

Bridging is encouraged
Failure leads to inquiry
Novelty is implemented

Make sure all stakeholders become “observers,” who can access software that helps them track observability metrics.”