API support for Lead time for MRs to be deployed to production
Release notes
The user should be able to use the API to present the lead time data in an external tool.
This feature is GitLab Ultimate
Problem to solve
As an executive investing in DevOps, I want to see my ROI. I want to see an improvement of my dev team in the time takes to MRs to be deployed to production . This will most likely drive the KPIs of my team.
As a developer leader I want to see that the team is improving by measuring the lead time for MRs to be deployed to production and comparing it over time (over sprints)
I want to import the data from GitLab into an external monitoring tool, so I need API support to poll the data.
Intended users
User experience goal
The user should be able to use the API to present the lead time data in an external tool.
Proposal
We should show the median time for merged MRs to production
- REST API and backend service to perform calculations
- REST API documentation
Note: GraphQL equivalent will be handled separately in #323966 (closed)
We need to support GET
API for this. As parameters there should be a configurable timestamp
- the default time, if not set should be 30 days
.
For this iteration the API should return the median time for merged MRs for the project scope, but in the future we will add group level and instance level support. So this should be taken into consideration when developing this API.
Also in the future we would want to add support for the environment that there was a merge
for. In the true sense of DORA 4 - this is only meant for production, but there may be more than one prod environment and/or users may interested in non-prod environments
Further details
There is another consideration: some additional work to create a data structure (tables) to store the metrics, and I think this should be designed so it is flexible enough to:
- commonly capture the data for all metrics
- support a view at Project or any level of Group
- support a minimal time slice which could then be aggregated
This is one metric out of 4 so there needs to be a common API structure to meet all needs.
Technical Proposal
This techncial proposal is to accomplish the DORA4 metrics related issues scheduled in 13.10 (Namely Deployment anaylytics and Lead Time for Changes). It provides a mid-term solution that allows any users/projects to use the dora4 metrics feature at scale regardless of project or group level.
This also partially resolves First class concept for production environments.
Current problems
- In a mid-sized project (or above), the database query takes about 10 sec to complete (w/ cold cache), which violates Query performance guidelines.
- The data aggregation happens in Ruby. This puts up additional 40-50 seconds. Ruby is not performant at compute-intensive process.
- The data aggregation happens for each API request even though the same data is presented regardless of the user roles.
- Given web nodes (puma) on production timeouts in 60 seconds, some API requests often result in a timeout error.
Approach
We should persist pre-aggreagated data into the database table and present it directly (for frontend and external systems). This is a similar approach with the daily coverage metrics feature.
Database table design
'environments' table - Existing table
id integer NOT NULL,
project_id integer NOT NULL,
name character varying NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
external_url character varying,
environment_type character varying,
state character varying DEFAULT 'available'::character varying NOT NULL,
slug character varying NOT NULL,
auto_stop_at timestamp with time zone,
auto_delete_at timestamp with time zone
tier (enum) - New column. One of 'production', 'staging', 'review-app', etc
- Sadly `environment_type` is already used for foldering feature, so we name it to `tier`.
'cd_daily_deployments_metrics' table - New table
group_id (Foreign key, cascading deletion, not null)
project_id (Foreign key, cascading deletion, not null)
environment_tier (not null)
date (date, not null) ... e.g. 2021-01-01
deployments_count (integer)
total_lead_time_for_changes (integer - unit: second)
Unique index on (project_id, environment_tier, date)
Unique index on (group_id, environment_tier, date)
Unique index on (group_id, project_id)
'deployment_merge_requests' table - Existing table
deployment_id integer NOT NULL,
merge_request_id integer NOT NULL,
environment_id integer,
lead_time_for_changes integer - New column. How long an merged MR took to be deployed
Pre-aggregation in the deployment life cycle
- A deployment succeeded.
-
Deployments::UpdateEnvironmentWorker
runs.-
environments.tier
is automatically set if it's null.
-
-
Deployments::LinkMergeRequestsService
runs.-
deployment_merge_requests.lead_time_for_changes
is set based on themerge_request.merged_at - deployment.deployed_at
.
-
- Schedule
Deployments::UpdateDailyDeploymentsMetricsWorker
in 10 minutes. (This is to ensure that above async tasks have already finished)- Upsert a row in
cd_daily_deployments_metrics
. It basically counts up thedeployments_count
andtotal_lead_time_for_changes
.
- Upsert a row in
Database queries
Project-level:
Deployment frequencies
SELECT deployments_count FROM cd_daily_deployments_metrics
WHERE project_id = 1 AND
environment_tier = 1 AND
date > '2021-01-01' AND
date < '2021-02-23'
ORDER BY date ASC
Lead time for changes
SELECT total_lead_time_for_changes FROM cd_daily_deployments_metrics
WHERE project_id = 1 AND
environment_tier = 1 AND
date > '2021-01-01' AND
date < '2021-02-23'
ORDER BY date ASC
Group-level:
Deployment frequencies
SELECT date, SUM(deployments_count) FROM cd_daily_deployments_metrics
WHERE group_id = 1 AND
environment_tier = 1 AND
date > '2021-01-01' AND
date < '2021-02-23'
GROUP BY date
ORDER BY date ASC
Lead time for changes
SELECT date, SUM(total_lead_time_for_changes) FROM cd_daily_deployments_metrics
WHERE group_id = 1 AND
environment_tier = 1 AND
date > '2021-01-01' AND
date < '2021-02-23'
GROUP BY date
ORDER BY date ASC
Permissions
- At project-level, Developer role (or above) can access the data.
- At group-level, Developer role (or above) can access the data. This is to ensure that the actor definitely has read permission to all subsequent projects.
Estimated weights
TBA (Should be achievable in 3 weeks)