Design: Activity information for Kubernetes Agent
Problem to solve
The users need to have visibility over the Agent related events in the GitLab UI so that they can ensure that the Agent is working as expected or be able to identify problems and get the information they need to troubleshoot.
Intended users
Personas are described at https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/
- Cameron (Compliance Manager)
- Sidney (Systems Administrator)
- Sam (Security Analyst)
- Allison (Application Ops)
- Priyanka (Platform Engineer)
User experience goal
The person who comes to this page should be able to see an Activity stream of information related to the Agent in order to troubleshoot deployment actions based on the activity list information. They should also be able to ensure that the Agent is working as expected and that its connection status is ok. This activity stream will include GitOps events (e.g. how many files/objects have been synchronized with the cluster in each sync) as well as main events that are related to the Agent (configuration updated, token created, manifest added/removed) and errors/alerts for failures.
Proposal
Add an Activity tab in the Agent details page listing activity items related to the Agent. Below are event categories and examples.
Events are stored in the DB for 1 week.
Token events
- New token
<token name>
created by<user>
at<time>
. - Token
<token name>
was revoked by<user>
at<time>
. - (not sure we need it) Token
<token name>
was first used at<time>
.
Connection events
-
Agent status changed to "connected" at
<time>
.On each call from kas to GitLab where kas passes an agent token, record the time of that interaction in the DB. This time is an attribute of the auth token that was used for the call. When time is recorded, check all valid tokens for this agent and see if the previous most recent call was more than N minutes/hours ago. If it was, record an "agent connected" event.
-
Agent status changed to "not connected" at
<time>
.To do this we'd need to have a background job that scans the agents' tokens to determine the agents that had been inactive for some time. If such an agent found, record an "agent disconnected" even using the last activity time from the most recent interaction. It's probably not worth the effort, especially considering we (should) have similar functionality on the agent details page - we check all tokens of the agent to see if any of them have been used recently. I.e. we show current status vs an event+date when it happened, but this is probably good enough.
Configuration events
-
Commit
<sha>
was received by agent at<time>
.To do this we need to track if this commit has been seen by an agent already and only record an event when it happens for the first time. No need to do it per-Pod, right? Note that we are not saying "configuration was changed" as that may be interpreted as "when the commit was made", we are more explicitly saying "an updated config reached the agent".
-
Error: Invalid configuration in commit
<sha>
, detected at<time>
. Message: some free-form text message likeinvalid YAML at line 8 column 3
orfield bla is missing
, etcTo do this we need to track if we have published an event for this commit for this agent already to avoid duplicate events (they are reported by each agent Pod). If we really want to we can make error messages more structured here rather than free form text. That way we can understand what it is programmatically, but that would require some effort, which I think is not justified at this stage.
-
Warning: Deprecated feature used in commit
<sha>
, detected at<time>
. Message: some_deprecated_feature_name configuration section is deprecated, please migrate to new_feature. See<link>
for more info.
GitOps events
The feature is called GitOps and manifest projects are just that - projects with manifests. I think it might be better to use feature names for grouping.
-
Commit
<sha>
was received by agent at<time>
. Sync started.See notes for a similar event in the section above.
-
Commit
<sha>
was successfully synchronized with the cluster at<time>
. -
Error: Invalid manifest in commit
<sha>
, detected at<time>
. Message: some free-form text message likeinvalid YAML at line 8 column 3
orfield bla is missing
, etc.See notes for a similar event in the section above.
-
Error: Failed to synchronize manifests. Triggered by commit
<sha>
, happened at<time>
. Message: some free-form text message likeAPI server timeout
. -
Warning: Deprecated Kubernetes feature used in commit
<sha>
, detected at<time>
. Message: Object kind extensions/Deployment/v1beta1 will be removed in Kubernetes v1.x.y. Please use apps/Deployment/v1. See<link>
for more info.This is an interesting idea - to show things that might be of interest to the user that we notice when applying the manifests. For this particular example, it might be better to have a tool that is linting the files in the manifest repos as part of CI rather than on each sync. Anyway, there might be something that we'd like to bring to user's attention and we can use warning events for that.
Future features
Features we'll add in the future quite likely will require events on this page too.
Technical details
Tentative proto definition of an event:
syntax = "proto3";
import "google/protobuf/timestamp.proto";
enum EVENT_KIND {
INFO = 0;
WARNING = 1;
ERROR = 2;
SUCCESS = 2;
}
message ConfigurationError {
string message = 1;
}
message GitOpsSyncError {
int64 project_id = 1;
string message = 2;
}
message Event {
int64 agent_id = 1;
google.protobuf.Timestamp happened_at = 2;
EVENT_KIND kind = 3;
oneof payload {
GitOpsSyncError gitops_sync_error = 100;
ConfigurationError configuration_error = 101;
}
}
Further details
We need to build:
- a way to pass this information from it's source (
agentk
,kas
or maybe the rails app in some cases) to some place for storage. - a UI for the user to be able to see it, get notified of new important events.
- Hooks on events #276248
User needs to be able to:
- View events list in historical order, most recent on top (see event types above) from Agent details page
- View details of event
- See the connection status of Agent
- Get notified of new
important
events what's an "important" event?
Permissions and Security
All users should be able to see this page.
Links / references
Design: Cluster details page for a cluster managed by an agent- Initial explorations