Metrics That Matter
Communications of the ACM, April 2019, Vol. 62 No. 4, Page 88
Practice : "Metrics That Matter"
By Benjamin Treynor Sloss, Shylaja Nukala, Vivek Rau
Site reliability engineering, or SRE, is a software-engineering specialization that focuses on the reliability and maintainability of large systems. In its experience in the field, Google has found some critical but oft-neglected metrics that are important for running reliable services.
This article, based on Ben Treynor's talk at the Google Cloud Next 2017 conference, addresses those metrics, specifically for product development and SRE teams, managers of such teams, and anyone else who cares about the reliability of Web products or infrastructure. To further explain its approach to product reliability, Google has published Site Reliability Engineering: How Google Runs Production Systems (hereafter referred to as the SRE book) and The Site Reliability Workbook: Practical Ways to Implement SRE (hereafter referred to as the SRE workbook).
One of the most important choices in offering a service is which service metrics to measure, and how to evaluate them. The difference between great, good, and poor metric and metric threshold choices is frequently the difference between a service that will surprise and delight its users with how well it works, one that will be acceptable for most users, and one that will actively drive away users—regardless of what the service actually offers.
For example, it is not uncommon to measure the QPS (queries per second) received at a Web or API server, and to assess that this metric indicates good service health if the graph of the metric over time has a smooth sinusoidal diurnal curve with no unexpected spikes or troughs, and the peaks of the curve are rising over time, indicating user growth. Yet this is a poor metric choice—at best it will provide the operator with a lagging indicator of large-scale problems. It misses a host of real, common problems, including partial unreachability, error rates in the 0.1%–3% range, high latency, and intervals of bad results.
These problems lead to unhappy users and service abandonment—yet throughout it all, the QPS Received graph continues to show its happy sinusoidal curves and to provide a soothing sense that all is well. The best that can be said about the QPS Received metric is that it's relatively simple to implement—and even that is a problem, because it is often implemented early and thus takes the place of more sophisticated and useful metrics that would provide an operator with more accurate and useful data about the service.
What follows are the types of metrics the Google SRE team has adopted for Google services. These metrics are not particularly easy to implement, and they may require changes to a service to instrument properly. It has been our consistent experience at Google, however, that every service team that implements these metrics is happy afterward that it made the effort to do so. The metrics investment is small compared with the overall effort to build and launch the service in the first place, and the prompt payback in user satisfaction and usage growth is out-sized relative to the effort required. We believe you will find this is true for your service, too.
About the Authors:
Benjamin Treynor Sloss started programming at age 6 and joined Oracle as a software engineer at 17. He has also worked at Versant, E.piphany, SEVEN, and (currently) Google. His team of approximately 4,700 is responsible for site reliability engineering, networking, and datacenters worldwide.
Shylaja Nukala is a technical writing lead for Google Site Reliability Engineering. She leads the documentation, information management, and select-training efforts for SRE, Cloud, and Google engineers.
Vivek Rau is a site reliability engineer at Google, working on customer reliability engineering (CRE). The CRE team teaches customers core SRE principles, enabling them to build and operate highly reliable products on the Google Cloud Platform.
"Our quest for robust time series forecasting at scale"
by Eric Tassone, Farzan Rohani
The Unofficial Google Data Science Blog
April 17, 2017