May 16, 2018
When Google published its Site Reliability Engineering (SRE) book — a detailed look at how it keeps production systems running — Forrester started getting a lot of questions. “Should I do this in my enterprise IT shop?” “I’m no unicorn — can I even do these things?” And perhaps most important: “What parts of the book are relevant?”
To answer these, we broke SRE down into 24 principles spread across six categories: service delivery, feature velocity, automation, monitoring, reliability, and architecture. We then spoke with clients implementing SRE. We discussed their objectives, successes, and setbacks. We also talked with vendors guiding customers’ implementations — including Google to get its take.
What we found is that you can apply most of Google’s advice — with some tweaking. I highly recommend reviewing the detailed analysis in our new report. To sum up the findings:
- Forty-six percent of the principles in the book work out of the box — they’re sound advice for any IT organization. This includes creating SLOs (service level objectives) that augment SLAs (service level agreements), implementing error budgets, and monitoring the four “golden signals” (latency, traffic, errors, and saturation). Do these today. Your customers will thank you.
- Fifty percent of the principles are good advice — but you’ll need to tweak them for your enterprise. This includes balancing tickets between operations and development, writing your own APIs to automate processes, and bringing down production systems to test resiliency. This isn’t bad advice per se, but your mileage may vary if you don’t alter them for your enterprise.
- There’s a small number — 4% — that you should not execute. This mostly had to do with load balancing, which is not an invalid approach, but Google has some geographical architecture challenges that your enterprise probably does not.
In the end, we recommend applying most of the concepts with some tweaking. Focus on the service delivery, feature velocity, and automation concepts in the book. Focus less on the architecture sections, as Google’s challenges likely don’t mirror your own.
- data center infrastructure management (DCIM)
- development & operations (DevOps)
- infrastructure & operations
- IT services