You can’t transform something you don’t understand. If you don’t know and understand what the current state of the customer experience is, how can you possibly design the desired future state?”
While there is no single set of metrics that every business needs to drive observability operations, every organization should track four basic categories of metrics to achieve observability.
Some, like application and infrastructure metrics, are obvious to most teams.
The most basic types of observability metrics are those associated with approaches like the RED Method or Google’s Four Golden Signals.
They boil down to tracking performance at the application level by monitoring:
When used for monitoring, these metrics help identify issues like an application that has become slow to respond or that is generating a higher-than-usual volume of errors.
In the context of observability, however, these metrics can be taken further to provide deeper insight into performance issues.
For instance, if you can correlate a spike in errors with an increase in request rates, it’s more likely than not that the application is generating errors because it is receiving more requests than it can handle. The solution in that case would probably be to spin up more instances of the application.
Real User Monitoring (RUM) is a capability that helps observers understand how real users interact and experience a digital interface (web application, mobile application) and whether or not their experience is satisfactory. It is often used as a starting point for problem detection and diagnosis.
The Essential Guide to Observability
You can gain further observability insights by tracking infrastructure metrics. The exact metrics to work with here depend on how your application is hosted – whether it runs in the cloud or on-premises, for example, and whether it’s orchestrated via Kubernetes. But in general, you’ll want to track:
In a distributed infrastructure like a Kubernetes cluster, you should also track the total number of nodes and changes in node state in order to ensure that you get ahead of issues such as a lack of available nodes.
To apply these metrics to observability, you should correlate them with other data points. An exhaustion of CPU and memory resources that occurs at the same time as an increase in application error rates may mean that the application is dropping requests due to lack of infrastructure resources, for example.
Delivering quality software quickly is a key predictor of an organization’s productivity, profitability, and customer satisfaction. However, siloed tool data blurs the view of how your organization is operating from a software delivery and DevOps perspective.
Observability is the backbone of your continuous integration (CI) pipelines to ensure the health, performance and reliability of the applications at each phase of the software delivery process. The 2021 DORA State of DevOps Report highlights five metrics that are key indicators of organizational software delivery and operational performance:
Deployment frequency
How often does your organization deploy code to production?
Lead time for changes
The time to go from code committed to code running in production.
Time to restore service
The time to restore service when an incident or defect occurs.
Change failure rate
A percentage of changes to production resulting in degraded service.
Reliability
Degree to which the software operates reliably—measured by SLIs, SLOs, and error budgets.
Finally, consider tracking business metrics, meaning metrics that align technical goals with business goals. Examples include:
Additionally, metrics such as SLAs, SLIs, and SLOs, while critical in software delivery, are also important to the business because they represent the promise you make to users about the reliability of your service. Will your application be accessible when they need it? Is it performing as expected? These metrics help asses the direct business impact of observability efforts.
Service Level Agreement
The formal agreement between you and your customers about the performance of your service.
Service Level Objective
The measurements that define your service’s performance in support of the SLA, i.e., availability or meantime to respond.
Service Level Indicator
How your service is really performing, i.e. are you keeping your promise to your customers, outlined in your SLAs?
Make sure all stakeholders become “observers,” who can access software that helps them track observability metrics. Avoid allowing any one team to “own” a particular tool or the data it generates. When observability and the insights it delivers are shared across the team, it becomes much easier to establish a feeling of shared responsibility--people are collectively invested in understanding the whys of something not working, than simply that it is not working. As this mentality pushes further left, teams begin to design, build, release—and resolve—collectively, spurring greater efficiency, reliability, and even innovation.
Teams can identify helpful practices that foster information flow and trust by examining the six aspects of Westrum's model of organizational culture, focusing on those behaviors seen in the generative culture:
Make sure all stakeholders become “observers,” who can access software that helps them track observability metrics.”