Member-only story
Troubleshooting Workloads on GKE for Site Reliability Engineers
Google Kubernetes Engine (GKE)
GKE is the industry’s first fully managed Kubernetes service that implements full Kubernetes API, 4-way autoscaling, release channels and multi-cluster support.
Site Reliability Engineers (SRE)
SRE have a broad set of responsibilities, and managing incidents is a critical part of their role.
The troubleshooting process is an “iterative” approach where SREs form a hypothesis about the potential root cause of an incident, then filter, search, and navigate through large volumes of telemetry data collected from their systems to validate or invalidate their hypothesis. If a hypothesis is invalid, SREs will form another hypothesis and perform another iteration until they can isolate a root cause. One the Google website, see learn more about SREs at Google Site Reliability Engineering.
In the below video, you will learn how to navigate that iterative journey efficiently and effectively using Google Cloud’s operations tools!
Video Link: https://www.youtube.com/watch?v=Y70KGRb5Lls