Rollout serving migrated data (feature flag `pages_serve_from_migrated_zip`)
What
Rollout :pages_serve_from_migrated_zip
feature flag that makes us serve data from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3252.
Introduced by: !52573 (merged)
Owners
- Team: GitLab Pages
- Most appropriate slack channel to reach out to:
#gitlab_pages
- Best individual to reach out to: @vshushlin @ayufan
Expectations
What are we expecting to happen?
A migrated projects will be served using Object Storage ZIP artifact instead of using VFS disk.
What might happen if this goes wrong?
A data might not be served properly.
What can we monitor to detect problems with this?
Similar to: gitlab-com/gl-infra/production#2808 (comment 430825927)
- Monitor the percentage of pages using VFS
zip
: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.stacked=1&g0.max_source_resolution=0s&g0.expr=sum(rate(gitlab_pages_vfs_operations_total%5B5m%5D))%20by%20(vfs_name)&g0.tab=0 => The amount of files served withzip
should increase - Monitor TTFB for Object Storage: https://thanos-query.ops.gitlab.net/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=avg(rate(gitlab_pages_httprange_trace_duration_sum%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D)%2Frate(gitlab_pages_httprange_trace_duration_count%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D))&g0.tab=0
- Monitor average latency: ZIP vs NFS: https://log.gprd.gitlab.net/goto/c6ef577321c0cba3c77086aef974202f
- Unique domains: https://log.gprd.gitlab.net/goto/80a0b0682e4987c0180bac8d39b80143
- Caching of ZIP archives: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.moment_input=2021-02-22%2015%3A15%3A02&g0.max_source_resolution=0s&g0.expr=avg(gitlab_pages_zip_cached_entries%7Bop%3D%22archive%22%7D)&g0.tab=0
- Amount of cached ZIP entries: https://prometheus-app.gprd.gitlab.net/graph?g0.expr=avg(gitlab_pages_zip_archive_entries_cached)&g0.tab=0&g0.stacked=0&g0.range_input=1h
We do percentage rollout
# 5% of projects
/chatops run feature set pages_serve_from_migrated_zip 5 --actors
# 10% of projects
/chatops run feature set pages_serve_from_migrated_zip 10 --actors
# 25% of projects
/chatops run feature set pages_serve_from_migrated_zip 25 --actors
# 50% of projects
/chatops run feature set pages_serve_from_migrated_zip 50 --actors
# 100% of projects
/chatops run feature set pages_serve_from_migrated_zip 1 --actors
Roll Out Steps
-
Enable on staging ( /chatops run feature set pages_serve_from_migrated_zip true --staging
) -
Test on staging -
Ensure that documentation has been updated - [-] Continue performing percentage rollout of actors
-
Enable on production for specific project ( /chatops run feature set --project=ayufan/pages-jekyll pages_serve_from_migrated_zip true
) -
Coordinate a time to enable the flag with the SRE oncall and release managers - In
#production
mention@sre-oncall
and@release-managers
. Once an SRE on call and Release Manager on call confirm, you can proceed with the rollout
- In
-
5% rollout /chatops run feature set pages_serve_from_migrated_zip 5 --actors
-
10% rollout /chatops run feature set pages_serve_from_migrated_zip 10 --actors
-
25% rollout /chatops run feature set pages_serve_from_migrated_zip 25 --actors
-
50% rollout /chatops run feature set pages_serve_from_migrated_zip 50 --actors
-
Enable a 100% rollout on GitLab.com by running chatops command in #production
(/chatops run feature set feature_name true
) -
Cross post chatops Slack command to #support_gitlab-com
(more guidance when this is necessary in the dev docs) and in your team channel -
Announce on the issue that the flag has been enabled -
Remove feature flag and add changelog entry -
After the flag removal is deployed, clean up the feature flag by running chatops command in #production
channel
Rollback Steps
-
This feature can be disabled by running the following Chatops command:
/chatops run feature delete pages_serve_from_migrated_zip
Edited by Vladimir Shushlin