SRE Site reliability engineering Notes
- SLIs / SLOs / Error Budgets - Measure what matters (e.g., latency, availability) and use "error budgets" to balance innovation vs. reliability—push features only if you're under budget.
- Commit to clear promises that set service objectives, expectations, and levels.
- Assess those promises continuously, with metrics and budgetary limits.
Toil Reduction: Automate repetitive ops work; aim for <50% of team time on manual tasks.
- Production Practices: Incident response, postmortems, and capacity planning as engineering disciplines.
- React quickly to keep and repair promises, be on-call, and guard autonomy to avoid new gatekeepers.
- On-Call and Automation: SREs code their way out of ops; leverage your infra skills for chaos engineering or canarying.
SRE principles - 1st book
embracing risk (Chapter 3)
>> Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.
>> when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.
>> In a sense, we view the availability target as both a minimum and a maximum. The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.
>> How can we use the service cost to help locate a service on the risk continuum?
>> Risk olerance of Services
- Identify the Risk Tolerance of Consumer Services
- Target level of availability
- What level of service will the users expect?
- Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?
- Is this a paid service, or is it free?
- If there are competitors in the marketplace, what level of service do those competitors provide?
- Is this service targeted at consumers, or at enterprises?
- Cost
- If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
- e.g. availability target: 99.9% → 99.99%, increase in availability: 0.09%, revenue: $1M, Value of improved availability: $1M * 0.0009 = $900
- if now simple translation function between reliability and revenue, strategy maybe to consider background error rate, no value in driving service below background error rate, e.g. packet loss 0.1%.
- Does this additional revenue offset the cost of reaching that level of reliability?
- If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
- Other service metrics
- Identifying the Risk Tolerance of Infrastructure Services
- by definition, have multiple clients, often with varying needs
- Target level of availability
Different use cases of a service >> Risk tolerance for these two use cases is quite distinct.
- Types of failures
- Different use cases
- Cost
- Different use cases, maybe build separate instances of service to satisfy different requirements. e.g Throughput vs Latency
>> Note that we can run multiple classes of services using identical hardware and software. We can provide vastly different service guarantees by adjusting a variety of service characteristics, such as the quantities of resources, the degree of redundancy, the geographical provisioning constraints, and, critically, the infrastructure software configuration.
embracing risk -> Error Budgets
- Tension between product development teams and SRE teams - evaluated on different metrics
- Product development pushes for velocity, and SRE for reliability, pushing back on high rate of change
>> "Hope is not a strategy
- Tensions
- Software fault tolerance
- Testing
- Push frequency
- Canary duration and size
- Goal - define an objective metric, agreed upon by both sides, to guide the negotiations in a reproducible way.
- two teams define quarterly error budget
- Forming Your Error Budget based on the service's SLO - clear objective that determines how unreliable the service is allowed to be withing a single quarter.
- ERROR Budge practice
- Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
- The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the "budget" of how much "unreliability" is remaining for the quarter.
- As long as the uptime measured is above the SLO —in other words, as long as there is error budget remaining — new releases can be pushed.
- ERROR Budget Benefits
- focus on finding the right balance between innovation and reliability.
- If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on.
- In effect, the product development team becomes self-policing.
- They know the budget and can manage their own risk.
- (Of course, this outcome relies on an SRE team having the authority to actually stop launches if the SLO is broken.
- What happens if a network outage or datacenter failure reduces the measured SLO? Such events also eat into the error budget
- The entire team supports this reduction because everyone shares the responsibility for uptime.
- The budget also helps to highlight some of the costs of overly high reliability targets, in terms of both inflexibility and slow innovation.
- If the team is having trouble launching new features, they may elect to loosen the SLO (thus increasing the error budget) in order to increase innovation.
- Tension between product development teams and SRE teams - evaluated on different metrics
- Key Insights
- Managing service reliability is largely about managing risk, and managing risk can be costly.
- 100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take.
- An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.
- service level objectives (Chapter 4)
- eliminating toil (Chapter 5)
DevOps (Chapter 6)
Reading
- Book "Thinking in Promises"
>> The goal of Promise Theory is to reveal the behavior of a whole from the sum of its parts, taking the viewpoint of the parts rather than the whole.
>> A conditional promise cannot be assessed unless the assessor also sees that the condition itself is promised.
>> A conditional promise is not a promise unless the condition itself is also promised.
>> Promise Theory makes a simple prediction about services, which is possibly counterintuitive. It tells us that the responsibility for getting service ultimately lies with the client, not the server.
