Managing services is hard for both service owners and stakeholders. To make things easier for everyone, define a clear set of expectations from the beginning. This helps measure and evaluate the health of services easier.
In this context, SLAs (Service Level Agreement) are likely familiar. An SLA is a written agreement between the client and the service provider to ensure a healthy level of quality. If specified conditions aren’t met there are consequences, and they are often financial.
However, the real world isn’t this simple. Service owners are accountable to serve both outside and inside stakeholders. These stakeholders depend on the services to meet their business objectives. This is especially common in microservices architectures, where one service is dependent on another. As it doesn’t make sense to have written contracts for everything, service owners should be held responsible by defining clear objectives. There are no severe penalties if those objectives aren’t met. Yet, this doesn’t mean they are there for nothing. There are some consequences, or rather– corrective actions, needed to improve those services.
A simple equation to define SLA and SLO relationship is:
SLA = SLO + written and signed consequences
Another important term to be familiar with is SLI (Service Level Indicator). SLIs are metrics used while evaluating SLOs.Now that the importance and differences between SLA, SLO, and SLIs has been identified, let’s focus on 5 key steps while measuring and evaluating SLOs.
Set the right objectives
Setting the right objectives is the first important step towards building proper SLOs. There are some important things to consider at this point:
Identify key metrics (service level indicators — SLIs) from the end-user viewpoint, such as latency
Make it measurable– such as 100 ms. latency
Allow some space (error budget) such as 100 ms. 99.9% of the time
Be clear on what you promise, for example 99.9% of the time (averaged over 10 minutes), HTTP calls are completed under 100 ms.
Consider product and business implications because setting the right objectives for SLOs aren’t purely technical as stated the in SRE Book.
Although these points are important and seem obvious, it is really hard to identify the right metrics. Talk openly with users and be clear on what is promised.
Collect monitoring data
Once important metrics have been identified, they need to be collected. This stage depends heavily on SLOs and what the service means to others. Different things may need to be monitored depending on the level of abstraction. Often what is needed is a monitoring tool like DataDog to collect and visualize the data. These tools allow for aggregation and alerting when the metric reaches the threshold defined.
Alert on collected metrics
Alerting is a critical and a complex job by itself. Filtering out low priority alerts and letting the team know about these are important for the health of on-call. But these are not the only places where an incident management solution such as Opsgenie helps. A proper incident management tool does “a lot” more than that. It centralizes all alerts from different monitoring tools in one dashboard and allows users to categorize important alerts for later analysis.
Create reports from alerts
Once all of the alerts are in one place it's important to setup alert reporting, which makes it easy to see important data points in a structured view. To report on SLOs, Service and Infrastructure Health Reports are used at Opsgenie which include key indicators that can be used to evaluate metrics and share with customers as a team. Examples of these metrics are mean time to resolve and close incidents per service, Service health percentage (healthy/unhealthy state by outages and disruptions), severity of incidents that arise in a service and the alerts associated with all incidents (so that insight is gained into which monitoring systems reported the incident in which way) and how stakeholders were affected by the service disruptions - whether they were notified in a timely and proper way. The infrastructure health reports provide infrastructure-wide context by allowing stakeholders to see all alerts and incidents across an entire infrastructure in a single view.
Evaluate and share the reports
Reports mean nothing if left un-evaluated. As they are the written proof of performance on the service level indicators defined internally, and they help to see if SLOs were met or not. Evaluation should include every team member and stakeholder. This means transparency is crucial– be open about them and share the results with others. To dig a little bit deeper with analytics tools or create more sophisticated reports for stakeholders, export the reports for easy sharing.
Once the cycle is completed– from creating the objectives and finishing with evaluating– the job still isn’t done. It starts all over again.Reevaluate objectives and take corrective actions either by refining the indicators or making services more robust. Clearly examine error budgets to make sure that overachievement is avoided (yes, that is bad too). It is important to design objectives taking into account that tools and services will fail, because they will.
If you would like additional information, Google’s SRE book is the definitive source for these concepts. Check out this article published by a Senior Site Reliability Engineer at Google. Another good resource is this blog on Monitoring services and setting SLAs with Datadog.