Research: Triage workflow to minimal

What’s this issue all about?

A triage workflow in Monitor is a process of detecting and identifying application performance bottleneck, intending to understand the root cause of the problem quickly and accurately.
In this research, I would like to understand the common workflow for Kubernetes based applications so we can mature our triage flow to minimal

Who is the target user of the feature?

DevOps engineers, SREs and monitoring team which monitors Kubernetes based applications.

What questions are you trying to answer?

How do you start a triage flow?
What types of dashboard do you observe, when looking for a root cause analysis?

Core questions

When monitoring an application:

What do you alert on?
What types of alerts wakes you up at night?
Do you alert on a pod level, container or application metric?
What is the first thing you do when you receive an alert?
What dashboard you look at first?
What metrics you look at? and at which level (Pod, container, application)
Do you navigate from metrics to logs and traces? and how?
What logs you look at (Pods, containers or application
Is there something that is missing in your current tool to conduct this flow?

Additional questions

What is your role in the organization?
What do you use gitlab for?
What would you say your main responsibilities are?
What are the tools you use to monitor a Kubernetes based application

What hypotheses and/or assumptions do you have?

Triage flow is one of the most critical steps in a monitoring solution
A typical triage flow starts with alert
Users would like to be alerted on metrics, then navigate into logs and only then to traces

What decisions will you make based on the research findings?

How to build and mature our triaging workflow

When do you need this research to be completed? (Milestone or date)

12.8/12.9

/cc @ameliabauerly @nadia_sotnikova

Edited Dec 18, 2019 by Sarah Waldner