Beta readiness review: ClickHouse Cloud for GitLab.com

Production Readiness

This issue serves as a tracking issue to guide you through the readiness review. It's not the production readiness document itself! The readiness documentation will be added to the project with a merge request, where stakeholders from different teams can collaborate.

Readiness MR

Beta readiness review: ClickHouse Cloud for Git... (!201 - merged) • Kennedy Wanyangu, Nate Rosandich • 16.10

Reviewers

The reviewers will be filled in as one of the steps of the checklist below. If a reviewer in the "Mandatory" section is not allocated, please add the reason why next to the name.

Mandatory

Reliability:
- SRE: @rehab
- DBRE: @alexander-sosna
Delivery: @rpereira2
InfraSec: @ugovindia (https://gitlab.com/gitlab-com/gl-security/security-operations/infrastructure-security/bau/-/issues/3135)

Optional

Delete these reviewers if they do not apply

Development: reviewer name you may want to consider a review from the team members who were closely involved in the development of this work to ensure that the details match their mental model
Scalability: reviewer name if there are concerns about how this will operate at scale, the Scalability group can help assess
Database: reviewer name if there are complex migrations or queries, the Database group to determine if these are safe to run
Application Security: reviewer name if there are concerns about application security, the group's Application Security stable counterpart can help

Readiness Checklist

The following items should be completed by the person initiating the readiness review:

Review the Production Readiness Review handbook page.
Create this issue and assign it to yourself.
- Set a due-date for when you believe the readiness will be completed (this can be updated later if necessary).
- Add a ~"workflow-infra::proposal" label to the issue while the mandatory reviewers listed in the issue template are assigned.
In the "Reviewers" section above, add the reviewer names. Names will be assigned by reaching out to the engineering manager of the corresponding team, do this by @ mentioning the team members for the following leadership groups:.
- Reliability: Reach out to the Reliability management team
- Delivery: Reach out to the Delivery management team
- InfraSec: Create an issue in this team's tracker. More information is available on the Infrastructure Security Team's handbook page. After the issue is created, put a link to the issue next to Infrasec reviewer item below and add the reviewer name after one has been assigned.
Create the first draft of the readiness review by copying the template below and submitting an MR. Do not remove any items or section in the template. It is only required to fill in the items up to and including the corresponding maturity level and lower. For example, for ReadinessBeta all sections under Beta and Experiment will need to be completed.
Assign the initial set reviewers to the MR. Once the MR has been assigned, add the label workflow-infraIn Progress to this issue.
Add a link to the MR in the "Readiness MR" section at the top of this issue
Once the MR has been sent out for review, add a ~"Readiness::* scoped label for the corresponding target maturity level for the review.
When last review of the MR is complete, and it is merged do one of the following:
1. If the feature will remain at the current maturity level for an uncertain amount of time, close the issue and add a ~"workflow-infra::done" label to the issue.
2. If the feature will need to reviewed for the next maturity level soon, add the corresponding ~"Readiness::* scoped label and repeat the process using the same issue.
(Optional) If it is later decided to not proceed with this proposal, add workflow-infraCancelled and close this issue

Readiness MR Template

Expand the section below to view the readiness template, this will be the starting point for the readiness merge request.

Create <name>/index.md as a new merge request with the following content where is something short and descriptive for the change being proposed

The Readiness Review document is designed to help you prepare your features and services for the GitLab Production Platforms. Please engage with the relevant teams as soon as possible to begin review even if there are incomplete items below. All sections should be completed up to the current maturity level. For example, if the target maturity is "Beta", then items under "Experiment" and "Beta" should be completed.

While it is encouraged for parts of this document to be filled out, not all of the items below will be relevant. Leave all non-applicable items intact and add 'N/A' or reasons for why in place of the response. This Guide is just that, a Guide. If something is not asked, but should be, it is strongly encouraged to add it as necessary.

Experiment

Service Catalog

The items below will be reviewed by the Reliability team.

Link to the service catalog entry for the service. Ensure that the following items are present in the service catalog, or listed here:
- Link to or provide a high-level summary of this new product feature.
- Link to the Architecture Design Workflow for this feature, if there wasn't a design completed for this feature please explain why.
- List the feature group that created this feature/service and who are the current Engineering Managers, Product Managers and their Directors.
- List individuals are the subject matter experts and know the most about this feature.
- List the team or set of individuals will take responsibility for the reliability of the feature once it is in production.
- List the member(s) of the team who built the feature will be on-call for the launch.
- List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.

Infrastructure

The items below will be reviewed by the Reliability team.

Do we use IaC (e.g., Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?
Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)?
Are all cloud infrastructure resources labeled according to the Infrastructure Labels and Tags guidelines?

Operational Risk

The items below will be reviewed by the Reliability team.

List the top three operational risks when this feature goes live.
For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to the metrics catalog for the service

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
How are the artifacts being built for this feature (e.g., using the CNG or another image building pipeline).

Security Considerations

The items below will be reviewed by the Infrasec team.

Link or list information for new resources of the following type:
- AWS Accounts/GCP Projects:
- New Subnets:
- VPC/Network Peering:
- DNS names:
- Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...):
- Other (anything relevant that might be worth mention):
Were the GitLab security development guidelines followed for this feature?
Was an Application Security Review requested, if appropriate? Link it here.
Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...). For example, using unattended upgrade or renovate bot to keep dependencies up-to-date?
For IaC (e.g., Terraform), is there any secure static code analysis tools like (kics or checkov)? If not and new IaC is being introduced, please explain why.
If we're creating new containers (e.g., a Dockerfile with an image build pipeline), are we using kics or checkov to scan Dockerfiles or GitLab's container scanner for vulnerabilities?

Identity and Access Management

The items below will be reviewed by the Infrasec team.

Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?
Was effort put in to ensure that the new service follows the least privilege principle, so that permissions are reduced as much as possible?
Do firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)?
Is the service covered by a WAF (Web Application Firewall) in Cloudflare?

Logging, Audit and Data Access

The items below will be reviewed by the Infrasec team.

Did we make an effort to redact customer data from logs?
What kind of data is stored on each system (secrets, customer data, audit, etc...)?
How is data rated according to our data classification standard (customer data is RED)?
Do we have audit logs for when data is accessed? If you are unsure or if using Reliability's central logging and a new pubsub topic was created, create an issue in the Security Logging Project using the add-remove-change-log-source template.
Ensure appropriate logs are being kept for compliance and requirements for retention are met.
If the data classification = Red for the new environment, please create a Security Compliance Intake issue. Note this is not necessary if the service is deployed in existing Production infrastructure.

Beta

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to examples of logs on https://logs.gitlab.net
Link to the Grafana dashboard for this service.

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

Are there custom backup/restore requirements?
Are backups monitored?
Was a restore from backup tested?
Link to information about growth rate of stored data.

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Does this feature have any version compatibility requirements with other components (e.g., Gitaly, Sidekiq, Rails) that will require a specific order of deployments?
Is this feature validated by our QA blackbox tests?
Will it be possible to roll back this feature? If so explain how it will be possible.

Security

The items below will be reviewed by the InfraSec team.

Put yourself in an attacker's shoes and list some examples of "What could possibly go wrong?". Are you OK going into Beta knowing that?
Link to any outstanding security-related epics & issues for this feature. Are you OK going into Beta with those still on the TODO list?

General Availability

Monitoring and Alerting

The items below will be reviewed by the Reliability team.

Link to the troubleshooting runbooks.
Link to an example of an alert and a corresponding runbook.
Confirm that on-call Reliability SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.

Operational Risk

The items below will be reviewed by the Reliability team.

Link to notes or testing results for assessing the outcome of failures of individual components.
What are the potential scalability or performance issues that may result with this change?
What are a few operational concerns that will not be present at launch, but may be a concern later?
Are there any single points of failure in the design? If so list them here.
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Backup, Restore, DR and Retention

The items below will be reviewed by the Reliability team.

Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
How does data age? Can data over a certain age be deleted?

Performance, Scalability and Capacity Planning

The items below will be reviewed by the Reliability team.

Link to any performance validation that was done according to performance guidelines.
Link to any load testing plans and results.
Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
Explain how this feature uses our rate limiting features.
Are there retry and back-off strategies for external dependencies?
Does the feature account for brief spikes in traffic, at least 2x above the expected rate?

Deployment

The items below will be reviewed by the Delivery team.

Will a change management issue be used for rollout? If so, link to it here.
Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?
Does building artifacts or deployment depend at all on gitlab.com?

Edited Mar 01, 2024 by Uday Govindia