Research: Triage workflow to minimal
What’s this issue all about?
A triage workflow in Monitor is a process of detecting and identifying application performance bottleneck, intending to understand the root cause of the problem quickly and accurately.
In this research, I would like to understand the common workflow for Kubernetes based applications so we can mature our triage flow to minimal
Who is the target user of the feature?
DevOps engineers, SREs and monitoring team which monitors Kubernetes based applications.
What questions are you trying to answer?
How do you start a triage flow?
What types of dashboard do you observe, when looking for a root cause analysis?
Core questions
When monitoring an application:
- What do you alert on?
- What types of alerts wakes you up at night?
- Do you alert on a pod level, container or application metric?
- What is the first thing you do when you receive an alert?
- What dashboard you look at first?
- What metrics you look at? and at which level (Pod, container, application)
- Do you navigate from metrics to logs and traces? and how?
- What logs you look at (Pods, containers or application
- Is there something that is missing in your current tool to conduct this flow?
Additional questions
- What is your role in the organization?
- What do you use gitlab for?
- What would you say your main responsibilities are?
- What are the tools you use to monitor a Kubernetes based application
What hypotheses and/or assumptions do you have?
- Triage flow is one of the most critical steps in a monitoring solution
- A typical triage flow starts with alert
- Users would like to be alerted on metrics, then navigate into logs and only then to traces
What decisions will you make based on the research findings?
How to build and mature our triaging workflow
When do you need this research to be completed? (Milestone or date)
12.8/12.9
Edited by Sarah Waldner