
SRE – Site Reliability Engineering About
Mostly, any site/SaaS needs support and maintenance. When we have a Service Level Agreement (SLA) with our customers for the service that we offer, it is a must to monitor, maintain and support.
Why do we need SLA? To ensure that we provide better performance for our services and to keep our customers happy. And SLAs ensure that our team and the customers are on the same page about what we offer and what to expect
How do we arrive at SLA? Out of experience or through better engineering practices?
It would be better to use good engineering practices. Google engineers pioneered / improved some engineering practices
Those are
SLA – Service Level Agreement – we promise some specific levels of performance to our customers
SLO – Service Level Objective – we set our internal goals that would ensure SLA is met out and commit to ourselves to meet them
SLI – Service Level Indicator – to measure our performance, we need some indicators that show service performance and those indicators should be easily measurable; and there should be some provision to monitor and measure the SLIs
Error Budget – Provision for some failures or feature release
Commitment for 100% Performance ?
Whether our SLA needs 100% of performance – it may be uptime, etc. Really does not need; as there are n-number of services and equipment like internet service, modem, power, etc that may not guarantee 100% performance, providing uptime of 100% just may not be realistic.
Here, let us consider uptime for our discussion and let us take 99.99% and 99.9%.
Arriving at Error Budget
When we consider 99.99% commitment for uptime, we have .01% provision for failure; 99.9% commitment, we have .1% provision for failure,
Let us convert these percentages into minutes of provision.
Let us take 4 weeks / 28 days for one small part of the support life cycle.
100 % is 28 x 24 x 60 = 40320 minutes, 99.99% of uptime commitment leaves 4.xx minutes and 99.9% of uptime commitment leaves 40.xx minutes for failure.
Depending on our commitment, we have either 4 minutes or 40 minutes as room for failure in 4 weeks; So that is our error budget.
Based on the Error Budget, we may strategise our response technique – whether to increase the availability or alternatives.
DevOps ?
To make our service available within 4 minutes or 40 minutes after a failure, our support team (SRE team) should have well tested automated procedures to find the issue and rectify. It is better that our service / software has some inherent provision to record the failure details as close as possible where it happens and make alerts so that our SRE team can judge on that and trigger a response that would resolve the failure and bring back the service.
SLAs on Customer Expectation
The SLAs should be based on the customer expectations; SLAs should focus on commitments of business functional levels, not at technical level. For example, one business functionality may involve many APIs or components; our SLAs must confine to the performance of business functionality; it should not drill down to fine granular to API or component level.
– Kuppuram, CTO, We3cares