项目作者: upgundecha

项目描述 :
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
高级语言: JavaScript
项目地址: git://github.com/upgundecha/howtheysre.git
创建时间: 2021-02-14T14:03:02Z
项目社区:https://github.com/upgundecha/howtheysre

开源协议:Creative Commons Zero v1.0 Universal

下载


How they SRE

PRs Welcome CI CodeQL

How they SRE


Introduction

How They SRE How They SRE is a curated knowledge repository of Site Reliability Engineering (SRE) best practices, tools, techniques, and culture adopted by leading technology or tech-savvy organizations.

Numerous organizations frequently share their insights and expertise, encompassing best practices, tools, and techniques that shape their engineering culture. They do this through various public platforms such as engineering blogs, conferences, and meetups. This repository compiles and presents content gathered from these sources.

Topics

  • Site Reliability Engineering
  • Hiring and Building SRE teams
  • SRE Culture
  • DevOps
  • Monitoring & Observability
  • Alerting
  • Incident Response & Post-Mortem
  • On-Call
  • Testing in Production
  • Chaos Engineering
  • Automation
  • Performance
  • Platform Engineering

Organizations


Achievers

### Blog Posts

Enter the Abattoir - Building ‘à la carte’ gitops tooling Scaling Production Globally — The service mesh facelift (Part-1)
Scaling Production Globally - Solving observability problems for developers (Part-2) Load Testing Kubernetes: Building a Framework (Part-1)
* Load Testing Kubernetes: Resolving bottlenecks and improving performance (Part-2)


Airbnb

### Blog Posts

Automated Incident Management Through Slack Detecting Vulnerabilities With Vulnture
Alerting Framework at Airbnb When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb
Intelligent Automation Platform: Empowering Conversational AI and Beyond at Airbnb Production Secret Management at Airbnb
Automating Data Protection at Scale, Part 1 Automating Data Protection at Scale, Part 2
Automating Data Protection at Scale, Part 3 Dynamic Kubernetes Cluster Scaling at Airbnb


Algolia

### Blog Posts

May 30 SSL incident A Journey Into SRE
* CI/CDay 2024: What makes a good CI/CD platform?


Alibaba Cloud

### Blog Posts

Why Are the Top Internet Companies Choosing SRE over Traditional O&M? Architecture and Practices of Bilibili’s Real-time Platform


Asana

### Blog Posts

How Asana uses Asana: Security incident response How Asana ships stable web application releases
Analysis of recent downtime & what we’re doing to prevent future incidents Developer environment: Achieving reliability by making it fast to reset
* Three security tactics for every IT leader to consider this fall


ASOS

### Blog Posts

Playing the blame-less game A day in the life of… Cat S (Head of Reliability Engineering)
An AKS Performance Journey: Part 1 — Sizing Everything Up An AKS Performance Journey: Part 2 — Networking It Out
Cyber Security @ ASOS.com Security Operations 24x7
* The skills we look for in Cyber Security Incident Response


Atlassian

### Blog Posts

Best practices for change management in the age of DevOps Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code
How to export Kubernetes events for observability and alerting Incident Postmortem Template


BackMarket

### Blog Posts

* How Back Market SREs prepared for Black Friday


Baidu

### Videos

Anomaly Detection on Golden Signals NetRadar: Monitoring the Datacenter Network
* Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity


Basecamp

### Blog Posts

Inside a CODE RED: Network Edition Three Basecamp outages. One week. What happened?
Basecamp 2 and Basecamp 3 search outage report Reducing Incident Escalations at Basecamp

### Books

* Shape Up


Bloomberg

### Videos

Capacity Planning and Performance Enhancement with Page Reference Sampling Why SREs can’t afford to NOT do Chaos Engineering
Tracing Real-Time Distributed Systems The Bloomberg Story: Building SRE Teams in an “Immeasurable” Organisation
* Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest


Booking.com

### Blog Posts

How Reliability and Product Teams Collaborate at Booking.com Incidents, fixes, and the day after
Troubleshooting: A journey into the unknown

### Videos
Sailing the Database Seas: Applying SRE Principles at Scale
SLOs for Data-Intensive Services Benefits of Taking the Less Traveled Road with Containers Infrastructure


Capital One

### Blog Posts

Automate Application Monitoring with Slack Automate AWS Infrastructure with Boto 3: AWS Health Check
Active-Active Shared-Nothing Database Architecture The 3 R’s of SREs: Resiliency, Recovery & Reliability
5 Steps to Getting Your App Chaos Ready 4 Real-World Scenarios That Read Like Chaos Engineering Experiments
Embrace the Chaos … Engineering 3 Lessons Learned From Implementing Chaos Engineering at Enterprise
A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy Secure Docker Containers Require Secure Applications
4 Steps for Pairing the Cloud and DevOps to Improve Resiliency Container Ready Applications with Twelve-Factor App and Microservices Architecture
Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS Architecting for Resiliency
Continuous Chaos — Introducing Chaos Engineering into DevOps Practices The Mon-ifesto Part 1: Metrics

### Major incidents & analysis reports

Information on the Capital One Cyber Incident A Case Study of the Capital One Data Breach

### Videos

Banking on Continuous Delivery - Capital One Continuous Chaos in DevOps - Capital One
DevOps at Capital One: Focusing on Pipeline and Measurement Automating the Management of the Operational Health of Cloud Accounts at Scale


Coinbase

### Blog Posts

* Open Sourcing Coinbase’s Secure Deployment Pipeline


DAZN

### Blog Posts

* Site Reliability at DAZN


DBS

### Blog Posts

Presenting at iThome’s SRE Conference: Our DBS SRE Transformation Journey Thus Far Debunking the seven most popular Site Reliability Engineering myths
How To Use SRE To Cultivate A Blameless Culture In The Workplace Site Reliability Engineering at DBS Bank
Automating Configuration Management at Scale How DBS dispelled the myths of Chaos Engineering
Double, Double Toil and Trouble

### Videos
SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS


DeepSource

### Blog Posts

Redis diskless replication: What, how, why and the caveats How to setup Vault with Kubernetes
* Breaking down zero downtime deployments in Kubernetes


Dream11

### Blog Posts

Deployment At Scale: Story Behind Dream11’s In-House Blue-Green Deployment Platform ‘OneClick’. Enhancing security and trust with AWS WAFv2
Lessons learned from running GraphQL at scale Break circuits, save Kong 🦍
Finding Order in Chaos: How We Automated Performance Testing with Torque Maintaining hyper-sonic releases at Dream11
To Scale In Or Scale Out? Here’s How We Scale at Dream11 Building Scalable Real Time Analytics, Alerting and Anomaly Detection Architecture at Dream11


Dropbox

### Blog Posts

Dropbox Engineering Career Framework - Reliability Engineer (SRE) Atlas: Our journey from a Python monolith to a managed platform
Monitoring server applications with Vortex Athena: Our automated build health management system
Interested in becoming a Site Reliability Engineer?

### Videos
Service Discovery Challenges at Scale


eBay

### Blog Posts

Resiliency and Disaster Recovery with Kafka SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue
SRE Case Study: Mysterious Traffic Imbalance Zero Downtime, Instant Deployment and Rollback
How eBay’s Notification Platform Used Fault Injection in New Ways

### Video
Madaari: Ordering for the Monkeys


Epic Games

### Video

* AWS re:Invent 2018: Epic Games Uses AWS to Deliver Fortnite to 200 Million Players


Etsy

### Blog Posts

Improving the Deployment Experience of a Ten-Year Old Application How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
Your brain on progress Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
Opsweekly: Measuring on-call experience with alert classification Demystifying Site Outages
Blameless PostMortems and a Just Culture Measure Anything, Measure Everything

### Videos

Velocity 09: John Allspaw and Paul Hammond, “10+ Deploys Pe Migrating a Monolith to the Cloud


Expedia

### Blog Posts

Automating Performance Standards Error Budget Policy - Part 1 - Adoption at Expedia Group
Error Budget Policy - Part 2 - Practices at Expedia Group Using Fault-Injection to Improve our new Runtime Platform’s Reliability
Learning from Incidents at Expedia Group Improving Vrbo Homepage Loading Experience
Troubleshooting 502 errors: ECS Checklist Getting Started with Elasticsearch
All about ISTIO-PROXY 5xx Issues Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?
How to Keep Your Kubernetes Deployments Balanced Across Multiple zones Are Your Dropwizard Latency Metrics Misleading You?
The Cost of 100% Reliability Creating Monitoring Dashboards
* Using Bash for DevOps


Fastly

### Videos

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager Resilience Engineering Mythbusting


G-Research

### Blog Posts

Our SRE Journey at G-Research The SRE Journey Continues
* OpenTSDB Meta Cache – trade-offs for performance


Getaround

### Blog Posts

How we handle incidents at Getaround Evolution Of Our Continuous Delivery Process


GitHub

### Blog Posts

How we improved availability through iterative simplification How we improved push processing on GitHub
How GitHub uses merge queue to ship hundreds of changes every day Fixing security vulnerabilities with AI
GitHub’s Engineering Fundamentals program: How we deliver on availability, security, and accessibility How GitHub uses GitHub Actions and Actions larger runners to build and test GitHub.com
The GitHub Security Lab’s journey to disclosing 500 CVEs in open source projects CodeQL team uses AI to power vulnerability detection in code
Addressing GitHub’s recent availability issues Building organization-wide governance and re-use for CI/CD and automation with GitHub Actions
Enabling branch deployments through IssueOps with GitHub Actions Using ChatOps to help Actions on-call engineers
Partitioning GitHub’s relational databases to handle scale Increasing developer happiness with GitHub code scanning
Why (and how) GitHub is adopting OpenTelemetry Improving large monorepo performance on GitHub
Deployment reliability at GitHub Improving how we deploy GitHub
Building On-Call Culture at GitHub Reducing flaky builds by 18x
The evolving role of operations in DevOps Getting started with DevOps automation
MySQL High Availability at GitHub

### Major incidents & analysis reports
GitHub Availability Report: August 2024
GitHub Availability Report: July 2024 GitHub Availability Report: June 2024
GitHub Availability Report: May 2024 GitHub Availability Report: April 2024
GitHub Availability Report: March 2024 GitHub Availability Report: February 2024
GitHub Availability Report: January 2024 GitHub Availability Report: December 2023
GitHub Availability Report: November 2023 GitHub Availability Report: October 2023
GitHub Availability Report: September 2023 GitHub Availability Report: August 2023
GitHub Availability Report: July 2023 GitHub Availability Report: June 2023
GitHub Availability Report: May 2023 GitHub Availability Report: April 2023
GitHub Availability Report: March 2023 GitHub Availability Report: February 2023
GitHub Availability Report: January 2023 GitHub Availability Report: December 2022
GitHub Availability Report: November 2022 GitHub Availability Report: October 2022
GitHub Availability Report: September 2022 GitHub Availability Report: August 2022
GitHub Availability Report: July 2022 GitHub Availability Report: June 2022
GitHub Availability Report: May 2022 GitHub Availability Report: April 2022
GitHub Availability Report: March 2022 GitHub Availability Report: February 2022
GitHub Availability Report: January 2022 GitHub Availability Report: December 2021
GitHub Availability Report: November 2021 GitHub Availability Report: October 2021
GitHub Availability Report: September 2021 GitHub Availability Report: August 2021
GitHub Availability Report: July 2021 GitHub Availability Report: June 2021
GitHub Availability Report: May 2021 GitHub Availability Report: April 2021
GitHub Availability Report: March 2021 GitHub Availability Report: February 2021
GitHub Availability Report: January 2021 GitHub Availability Report: December 2020
GitHub Availability Report: November 2020 GitHub Availability Report: August 2020
GitHub Availability Report: July 2020 Introducing the GitHub Availability Report
February service disruptions post-incident analysis October 21 post-incident analysis
February 28th DDoS Incident Report Incident Report: Inadvertent Private Repository Disclosure

### Videos

* One on One SRE


GitLab

### Blog Posts

This SRE attempted to roll out an HAProxy config change. You won’t believe what happened next… My week shadowing a GitLab Site Reliability Engineer
Update: Elasticsearch lessons learnt for Advanced Global Search Lessons in iteration from a new team in infrastructure
How we optimized infrastructure spend at GitLab How we scaled async workload processing at GitLab.com using Sidekiq
Inside GitLab: How we release software patches What tracking down missing TCP Keepalives taught me about Docker, Golang, and GitLab
* How we used delayed replication for disaster recovery with PostgreSQL


GoCardless

### Blog Posts

Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial How we compress Pub/Sub messages and more, saving a load of money
Fear-free PostgreSQL migrations for Rails Observability at GoCardless: a tale of API performance improvement
Debugging the PostgreSQL query planner Zero-downtime Postgres migrations - the hard parts
In search of performance - how we shaved 200ms off every POST request

### Major incidents & analysis reports
Incident review: Service outage on 25 October 2020, Vault TLS expiry
* Incident review: API and Dashboard outage on 10 October 2017


GoDaddy

### Blog Posts

Kubernetes Gated Deployments Kubernetes External Secrets
Kubernetes - A Practical Introduction for Application Developers An Intuitive Node.js Client for the Kubernetes API


Gojek

### Blog Posts

Introducing Skynet: Infrastructure as Code for Gojek Scaling Our Geo-Search Service For 10x Load
Why We Swear by the RCA How We Upgrade Kubernetes on GKE
* How We Monitor Apache Airflow in Production


Goldman Sachs

### Blog Posts

SecDb Observability Journey Chaos Testing an Application on AWS
Forecasting Capacity Outages Using Machine Learning to Bolster Application Resiliency Providing 99.9% Availability and Sub-Second Response Times with Sybase IQ Multiplexes by Using HAProxy
Building Multi-Region Resiliency with Amazon RDS and Amazon Aurora Enabling Highly Available Trino Clusters at Goldman Sachs
Observability at Scale Infrastructure and the Command Chain Pattern
Mobile CICD with EC2 macOS Announcing CatchIT - Source Code Secret Scanner
Building Platforms for Data Engineering

### Videos
Granular CPU Capacity Management at Scale with eBPF


Google

### Blog Posts

Accelerating incident response using generative AI Pitfalls and Patterns in Microservice Dependency Management
SRE Practices & Processes Google site reliability using Go
Three months, 30x demand: How we scaled Google Meet during COVID-19 SRE Classroom: Distributed PubSub
How SRE teams are organized, and how to get started

### Videos
Get Your Non-SREs Oncall Ready!
Reliable Data for Large ML Models: Principles and Practices New Grads Becoming New SREs: Catalyzing a “Circle of Life” in Ireland
SRE for [cyber]security Artificial Intelligence: How Much Will It Cost You?
What’s the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google
Pragmatic Automation’ with Max Luebbe of GCP Must Watch! - Google SRE YouTube Playlist
Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit Implementing Distributed Consensus
The SRE I Aspire to Be SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
Zero Touch Prod: Towards Safer and More Secure Production Environments All of Our ML Ideas Are Bad (and We Should Feel Bad)
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It Deploying SRE Training Best Practices to Production: How We SRE’ed Our SRE Education Program
Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way Practical Instrumentation for Observability
What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services Unified Reporting of Service Reliability
How to Trade off Server Utilization and Tail Latency Keeping the Balance: Internet-Scale Loadbalancing Demystified
From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services Mindfulness in SRE: Monitoring and Alerting for One’s Self
Pragmatic Automation Sublinear Scaling in Practice: The 1k SRE Project
Strategies to Edit Production Data The Curse of SRE Autonomy and How to Manage It
Scaling SRE Organizations: The Journey from 1 to Many Teams SRE Classroom - How to Design a Distributed System in 3 Hours
Using PRDs and User Journeys to Design User-Friendly Tools How Google SRE and Developers Work Together
* SREcon21 - Experiments for SRE


Grab

### Blog Posts

Our Journey to Continuous Delivery at Grab (Part 1) Our Journey to Continuous Delivery at Grab (Part 2)
Designing Resilient Systems: Circuit Breakers or Retries? (Part 1) Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)
Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering Orchestrating Chaos using Grab’s Experimentation Platform
How We Designed the Quotas Microservice to Prevent Resource Abuse How We Scaled Our Cache and Got a Good Night’s Sleep


Grammarly

### Blog Posts

Scaling AWS Infrastructure to Support Multiple Regions Security Operations in an AWS Environment


Gusto

### Blog Posts

Service Level Objectives for On-call Peace of Mind Debugging Sidekiq Poison Pills


Halodoc

### Blog Posts

* Site Reliability Engineering for Native mobile apps


Heroku

### Blog Posts

The Adventures of Rendezvous in Heroku’s New Architecture Incident Response at Heroku


IBM

### Blog Posts

What is Site Reliability Engineering (SRE)? AIOps tools and solutions


Indeed

### Blog Posts

Indeed SRE: An Inside Look Being Just Reliable Enough
Automating Indeed’s Release Process Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com

### Videos

* Are We Getting Better Yet? Progress Toward Safer Operations


Indeed

### Blog Posts

* SRE Playbook - Practical Guide


Khan Academy

### Blog Posts

How Khan Academy Successfully Handled 2.5x Traffic in a Week Evolving our content infrastructure


LinkedIn

### Blog Posts

Rethinking site capacity projections with Capacity Analyzer Insights into a Product SRE team at LinkedIn
Hiring SREs at LinkedIn Open source update: School of SRE
Fixing Linux filesystem performance regressions Production testing with dark canaries
Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform Iris mobile: An open source, mobile interface for incident management
LinkedOut: A Request-Level Failure Injection Framework Eliminating toil with fully automated load testing
The Makeup of Successful Geographically-Distributed SRE Teams: Part 1 The Makeup of Successful Geographically-Distributed SRE Teams: Part 2
[Project STAR: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)
Automating Your Oncall: Open Sourcing Fossor and Ascii Etch Resilience Engineering at LinkedIn with Project Waterbear
Hiring SREs at LinkedIn, 2017 Open Sourcing Iris and Oncall
Building the SRE Culture at LinkedIn Failure is Not an Option
MTTD and MTTR Are Key What Gets Measured Gets Fixed

### Videos

Growing the Site Reliability Team at LinkedIn: Hiring is Hard — Greg Leffler 9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE
Weathering the Storm: How Early Warnings Save the Farm Unconference: Unsolved Problems in SRE
Leading without Managing: Becoming an SRE Technical Leader Why Does (My) Monitoring Suck?
Traffic Forecasting and Stress Testing Infrastructure Collective Mindfulness for Better Decisions in SRE
TCP—Architecture, Enhancements, and Tuning Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up
Understanding Business Metrics Can Make You a Better SRE Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
Differences in SRE Implementations across Companies

### Tools
On-Call


Loggi

### Blog Posts

The Release Manager model SRE Teams #8: Loggi


Loveholidays

### Blog Posts

Dynamic alert routing with Prometheus and Alertmanager Making loveholidays 18% faster with HTTP/3
Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code The 5 principles that helped scale loveholidays
* Realtime Fastly logs with Grafana Loki for under $1 a day


Macquarie

### Blog Posts

Our DevSecOps journey with Golang Pipeline Configuration as Code with Kotlin
DevOps and Segregation of Duties Macquarie embraces DevOps
* Scaling a Kubernetes Platform across the Enterprise


Mattermost

### Blog Posts

Monitoring Cloud Environments at Scale with Prometheus and Thanos How We Use Sloth to do SLO Monitoring and Alerting with Prometheus


Meituan (美团)

### Blog Posts

* The development and practice of SRE in the cloud (云端的SRE发展与实践)


Mercari

### Blog Posts

Who Watches the Watchmen? Keeping an Eye on Our Monitoring Systems What the Microservices SRE Team are doing as SRE Evangelists
What it’s like to work as an embedded microservices SRE The Merpay SRE Team: Past and future
Embedded SRE at Mercari What the SRE team wants to achieve with the development team
DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry? How do we share troubleshooting skills
* Datadog Dashboard at Scale w / Terraform


Meta

### Blog Posts

Leveraging AI for efficient incident response Improving Meta’s SLO workflows with data annotations
SLICK: Adopting SLOs for improved reliability More details about the October 4 outage
Update about the October 4th outage

### Videos
Scheduling at Scale: eBPF Schedulers with Sched_ext
A Customer Service Approach to SRE How (Not) to Scale a Project: A Post-Mortem
Releasing the World’s Largest Python Site Every 7 Minutes Using ML to Automate Dynamic Error Categorization


Microsoft

### Videos

SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft
Sustainable Software Engineering & SREs Study on Human Factors and Team Culture to Improve Pager Fatigue
Prioritizing Trust While Creating Applications Building Resilience: How to Learn More from Incidents
A Tale of Two Postmortems: A Human Factors View Availability—Thinking beyond 9s
Ironies of Automation: A Comedy in Three Parts The Ops in Serverless


MIRO

### Blog Posts

Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture
* Reliable load testing with regards to unexpected nuances


Monzo

### Blog Posts

Autoscaling Monzo: How we optimise our platform to be just the right size How we’ve evolved on-call at Monzo
How we respond to incidents How we monitor Monzo

### Videos

Eventually Consistent Service Discovery

### Tools
Response


Netflix

### Blog Posts

Achieving observability in async workflows Building Netflix’s Distributed Tracing Infrastructure
Lessons from Building Observability Tools at Netflix Edgar: Solving Mysteries Faster with Observability
Telltale: Netflix Application Monitoring Simplified Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix
Introducing Dispatch Applying Netflix DevOps Patterns to Windows
ChAP: Chaos Automation Platform Starting the Avalanche
Netflix Chaos Monkey Upgraded Chaos Engineering Upgraded
Automated Failure Testing From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
Introducing Atlas: Netflix’s Primary Telemetry Platform FIT: Failure Injection Testing
Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis Lessons Netflix Learned from the AWS Outage
Scryer: Netflix’s Predictive Auto Scaling Engine

### Major incidents & analysis reports
Post-mortem of October 22, 2012 AWS degradation

### Videos

Achieving Excellence: SLO Thresholds That Transform Service Quality AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)
When /bin/sh Attacks: Revisiting “Automate All the Things” How Did Things Go Right? Learning More from Incidents
Monitoring and Tracing @Netflix Streaming Data Infrastructure Real user performance monitoring at Netflix scale ‐ Martin Spier
AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)
Netflix: Multi-Regional Resiliency and Amazon Route 53 Designing Services for Resilience: Netflix Lessons
South Bay SRE Meetup - Netflix Cloud Performance Team AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)
How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows Mastering Chaos - A Netflix Guide to Microservices
AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere​ Global Architecture (ARC204) SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs
From Sys Admin to Netflix SRE Application Resilience Engineering and Operations at Netflix with Hystrix
Injecting Failure at Netflix LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability
Incident Management at Netflix Velocity

### Podcasts
Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems

### Tools

* Dispatch


New Relic

### Blog Posts

Defining Modern Software Roles: SREs at New Relic 10 Things Everybody Needs to Know About Site Reliability Engineering (SRE)
What Tools Do Site Reliability Engineers Use? A Day in the Life of a New Relic SRE
7 Habits of Highly Successful Site Reliability Engineers Adopting the practice of SRE
* Using modern observability to establish a data-driven culture


Nubank

### Blog Posts

Engineering operational excellence, a case of continuous improvement How we deal with technical incidents
How we do On-Call Rotations at Nubank How we scale our data platform efficiently and reliably
Why We Killed Our End-to-End Test Suite Automatic retraining for machine learning models: tips and lessons learned


OpenAI

### Blog Posts

March 20 ChatGPT outage: Here’s what happened OpenAI SRE and scaling explained easy.
Scaling Kubernetes to 2,500 nodes Scaling Kubernetes to 7,500 nodes
* Scaling AI Infrastructure at OpenAI


PayPal

### Blog Posts

Triggered: Incident #1234 (incident process needs fixing) Implementing Observability in a Service Mesh
PostgreSQL at Scale: Database Schema Changes Without Downtime Scaling GraphQL at PayPal

### Videos

SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal
Detecting Service Degradation and Failures at Scale through Distributed Log Processing Operating Elasticsearch with Ease at Scale
* Ensuring Site Reliability through Security Controls


Picnic

### Blog Posts

Micrometer and the Modern Observability Stack Monitoring and Observability at Picnic


Pinterest

### Blog Posts

Ensuring High Availability of Ads Realtime Streaming Services Improving efficiency and reducing runtime using S3 read optimization
Scaling Kubernetes with Assurance at Pinterest What we learned from an iOS app OOMs incident
How we designed our Continuous Integration System to be more than 50% Faster Simplifying web deploys
Upgrading Pinterest operational metrics Distributed tracing at Pinterest with new open source tools
Auto scaling Pinterest

### Videos
Building Actionable Code Ownership
Evolution of Observability Tools at Pinterest Automating OS/Platform Upgrades for Service Owners


Postman

### Blog Posts

* Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana


Prezi

### Blog Posts

How to avoid global outage — Seamlessly migrating DaemonSet labels In search of speed — debugging Elasticsearch performance
* Prometheus at Prezi: replacing 10 years of anti-patterns


Red Hat

### Blog Posts

From Ops to SRE: Evolution of the OpenShift Dedicated Team 5 Agile Practices Every SRE Team Should Adopt
* 7 Best Practices for Writing Kubernetes Operators: An SRE Perspective


Reddit

### Videos

* Noisy Neighbors, through Networking


Riot Games

### Blog Posts

THE LEGENDS OF RUNETERRA CI/CD PIPELINE STRATEGIES FOR WORKING IN UNCERTAIN SYSTEMS
IMPROVING THE DEVELOPER EXPERIENCE FOR OPERATING SERVICES SCALABILITY AND LOAD TESTING FOR VALORANT
LEVERAGING GOLANG FOR GAME DEVELOPMENT AND OPERATIONS CONTROLLED CHAOS WITH FAULT INJECTION TESTING
DOWN THE RABBIT HOLE OF PERFORMANCE MONITORING PROFILING: THE CASE OF THE MISSING MILLISECONDS
PROFILING: REAL WORLD PERFORMANCE IN LEAGUE PROFILING: OPTIMISATION
PROFILING: MEASUREMENT AND ANALYSIS RUNNING ONLINE SERVICES AT RIOT: PART I
RUNNING ONLINE SERVICES AT RIOT: PART II RUNNING ONLINE SERVICES AT RIOT: PART III
RUNNING ONLINE SERVICES AT RIOT: PART III: PART DEUX RUNNING ONLINE SERVICES AT RIOT: PART IV
RUNNING ONLINE SERVICES AT RIOT: PART V THE EVOLUTION OF SECURITY AT RIOT
RUNNING AN AUTOMATED TEST PIPELINE FOR THE LEAGUE CLIENT UPDATE AUTOMATED TESTING FOR LEAGUE OF LEGENDS

### Videos

* Riot Games: Evolution of Observability at the Gaming Company


Salesforce

### Blog Posts

Looking at the Kubernetes Control Plane for Multi-Tenancy Optimizing EKS networking for scale
Zero Downtime Node Patching in a Kubernetes Cluster How, Not Why: An Alternative to the Five Whys for Post-Mortems
A Generic Sidecar Injector for Kubernetes Implementation of a monitoring strategy for products based on microservices
10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use Our Journey to a Near Perfect Log Pipeline
Optimizing Performance with Web Workers Take A Moment To Refocus


Schibsted Media

### Blog Posts

* Reliability engineering for some of top 10 sites in Scandinavia


Scribd

### Blog Posts

Learning from incidents: getting Sidekiq ready to serve a billion jobs A testimonial for using PagerDuty at Scribd
* Assigning pager duty to developers


Shopify

### Blog Posts

Resiliency Planning for High-Traffic Events Capacity Planning at Scale
Using DNS Traffic Management to Add Resiliency to Shopify’s Services Four Steps to Creating Effective Game Day Tests
Implementing ChatOps into our Incident Management Procedure StatsD at Shopify

### Videos

Enhancing Elasticsearch Performance: Innovative Reindexing Strategies Using Dedicated Nodes and KEDA Autoscalers Network Monitor: A Tale of ACKnowledging an Observability Gap
Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures Advanced Napkin Math: Estimating System Performance from First Principles


Sky Betting and Gaming

### Blog Posts

It’s Just a Monitoring Change “What’s the worst that could happen?”: A worked example of how we deal with live incidents
Rising from the Ashes Crash! Bang! Wallop! Practice makes perfect
* Performance Left Right and Center


Slack

### Blog Posts

Slack’s Incident on 2-22-22 Infrastructure Observability for Changing the Spend Curve
Slack’s Outage on January 4th 2021 A Terrible, Horrible, No-Good, Very Bad Day at Slack
Deploys at Slack Disasterpiece Theater: Slack’s process for approachable Chaos Engineering

### Videos

Scaling Chef Emotionally Slack at the Edge
* What Breaks Our Systems: A Taxonomy of Black Swans


Slalom Build

### Blog Posts

How to Implement Service Level Objectives in New Relic APM Beginners Guide to DevOps: How to Make It into the Industry
GitHub Actions: Beyond CI/CD Why isn’t all test automation run on the pipeline?
The Many Shapes of Site Reliability Engineering How to build a secure by default Kubernetes cluster with a basic CI/CD pipeline on AWS
Secret Management Architectures: Finding the balance between security and complexity Detecting Malicious Requests with Keras & Tensorflow
The Lego Monolith — A Monolith Microservice Proof of Concept Managing Secrets Using Hashicorp Vault
Packaging Spring Boot Applications for Deployment on Kubernetes Immutable Infrastructure and Continuous Delivery in the Cloud


Soundcloud

### Blog Posts

How to Successfully Hand Over Systems Building a Healthy On-Call Culture
Alerting on SLOs like Pros Hands-Off Deployment with Canary
Prometheus has come of age – a reflection on the development of an open-source project Prometheus: Monitoring at SoundCloud
What I Learned in One Year as an SRE Trainee Tests Under the Magnifying Lens


Spotify

### Blog Posts

Matt Clarke: Senior Backend Infrastructure Engineer Designing a Better Kubernetes Experience for Developers
Techbytes: What The Industry Misses About Incidents and What You Can Do Automated Incident Response Infrastructure in GCP

### Videos

* Tracing, Fast and Slow: Digging into and Improving Your Web Service’s Performance


Squarespace

### Blog Posts

Under the Hood: Ensuring Site Reliability

### Videos
Pushing through Friction
How to SRE When Everything’s Already on Fire Case Study: Implementing SLOs for a New Service
* Creating a Code Review Culture


Stack Overflow

### Blog Posts

“This should never happen. If it does, call the developers.” Infrastructure as code: Create and configure infrastructure elements in seconds
Fulfilling the promise of CI/CD A deeper dive into our May 2019 security incident
Guest Post - Failing over without falling over How We Built Our Blog
Stack Overflow Frees Up Engineering Time with Netlify

### Videos
Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline


Strava

### Blog Posts

Scaling Club Leaderboard Infrastructure for Millions of Users Distributed Tracing at Strava


Stripe

### Blog Posts

Fast and flexible observability with canonical log lines Fast builds, secure builds. Choose two.
Introducing Veneur: high performance and global aggregation for Datadog

### Videos
How Stripe Invests in Technical Infrastructure
* The AWS Billing Machine and Optimizing Cloud Costs


Target

### Blog Posts

Ɔhaos Ǝnginǝǝring @ Target - Part 2 Ɔhaos Ǝnginǝǝring @ Target - Part 1
* GoAlert - Your Future Open Source, On-Call Notification Product


Teads

### Blog Posts

* Scaling your on-duty team


Tinder

### Blog Posts

The Ultimate Load Test How We Improved Our Performance Using ElasticSearch Plugins: Part 1
How We Improved Our Performance Using ElasticSearch Plugins: Part 2 Tinder’s move to Kubernetes


Tokopedia

### Blog Posts

Benefits of benchmarking with Go Simulating Customized Chaos in Golang using Toxiproxy
* How Tokopedia Rank Millions of Products in Search Page


Trivago

### Blog Posts

* How To Get Fooled By Metrics


Twilio

### Blog Posts

* Twilio SRE Gameday Template


Twitter

### Blog Posts

Logging at Twitter: Updated Deleting data distributed throughout your microservices architecture
Deterministic Aperture: A distributed, load balancing algorithm MetricsDB: TimeSeries Database for storing metrics at Twitter
The Infrastructure Behind Twitter: Scale The infrastructure behind Twitter: efficiency and optimization


Uber

### Blog Posts

Founding Uber SRE Disaster Recovery for Multi-Region Kafka at Uber
Engineering Failover Handling in Uber’s Mobile Networking Infrastructure Optimizing Observability with Jaeger, M3, and XYS at Uber

### Videos

A Tale of Two Rotations: Building a Humane & Effective On-Call Testing in Production at Scale
* A History of SRE at Uber’ with Rick Boone of Uber


Udemy

### Blog Posts

Blameless Incident Reviews at Udemy How Udemy does Build Engineering

### Videos

Monitoring Systems as a Service – Walking the Line between Giving Your Devs Good M&O and Setting All Your Money on Fire Udemy - How to Do SRE When You Have No SRE


upGrad

### Blog Posts

Web Performance and Related Stories — upgrad.com Beginner’s guide to web analytics
* iOS Continuous Deployment with Bitbucket, Jenkins and Fastlane at UpGrad


VGW

### Blog Posts

The SRE Incident Response game

### Videos
Level Up Your Incident Response With Gameplay


Wikimedia Foundation

### Videos

Testing Encyclopedias in Production What Happens When You Type en.wikipedia.org?


Wix

### Blog Posts

How We Improved Website Performance by Evolving Our Infrastructure Wix Inbox Journey: 3 Approaches for Zero Downtime Database Migration
Moving Velo to Multiple Container Sites: The Why, The How and The Lessons Learned Making Order in CI/CD Mess


Yelp

### Blog Posts

The process: Implementing Yelp’s failover strategy

### Videos
Yelp - What I Wish I Knew before Going On-Call


Zalando

### Blog Posts

Tracing SRE’s journey in Zalando - Part I Tracing SRE’s journey in Zalando - Part II
Tracing SRE’s journey in Zalando - Part III

### Vidoes
The Frontiers of Reliability Engineering
* Service Level Objectives


Zerodha

### Blog Posts

Infrastructure monitoring with Prometheus at Zerodha Logging at Zerodha


Zomato

### Blog Posts

* Huddle Diaries – DevOps and Data Platform

SRECon Mix Playlist

Videos


Resources

📚 Books

Events

Other Resources

Awesome Lists

SRE Resources from various organizations

Incidents & postmortems

Newsletters

Credits

Other How They… repos

Contributors



Contribute

Contributions welcome! Read the contribution guidelines first.

Stargazers Over Time

Stargazers over time

License

CC0

To the extent possible under law, Unmesh Gundecha has waived all copyright and
related or neighboring rights to this work.


If you decide to use this anywhere, please credit @upgundecha on X. Also, if you like my work, check out my other projects on GitHub.