[Spike] Evaluate Charts CI review app workflow

Summary

While testing #1307 (closed), some thoughts came to mind on our Review App workflow in CI.

Thoughts on our Review App workflow in CI

We have many Charts pipeline failures, and a strong majority of the failures are fixed with a retry. They are often transient failures due to networking problems, cluster resource exhaustion, or conflicts from multiple pipelines trying to influence the same environment.
We (understandably) only test a mostly default installation in CI. This means that quite often, changes made in a merge request are not actually reflected in the releases we deploy to the clusters. That makes many of our QA tests not particularly indicative of whether or not a change is valid. In fact, a strong majority of the time, failures are either retried until they pass or completely ignored because it's obvious that the change set should have no impact on the results of these tests.
We struggle to keep up with support for new Kubernetes releases because of the effort required to create new clusters, provision them with dependencies like Cert Manager, install Kubernetes Agents, and update our CI configuration to reference them.
We currently deploy to one version of EKS and two versions of GKE. Once helm install returns successfully, we are fairly confident that we have full compatibility with that version of Kubernetes because no deprecated APIs that have been removed were referenced and no other issues with our manifest configurations are present. So for the most part, a successful helm install goes a long way toward validating support for a given version of Kubernetes. After that, it's up to the application logic running inside those workloads to function properly - which doesn't have much bearing on the underlying version of Kubernetes.

Considerations for improving our workflow

Given the thoughts above, here are some considerations worth discussion for our Review App workflow:

We could run ephemeral Kubernetes environments with a tool like vcluster to confirm support for new Kubernetes versions with very little effort. For an example, see gitlab-org/charts/gitlab!3378 (17a551e3).
We could run QA tests against only one live environment in a cloud provider, either EKS or GKE (or both if we really want to). By cutting down on the number of Review Apps we run QA against, we greatly reduce the opportunities for transient failures and can save significant time during DRI duties debugging environments.
Building on the previous point, we could even consider only running QA tests on master and stable branches. Many of the day-to-day changes to the Charts project affect Kubernetes manifest generation, which is well tested with our rspec configuration as well as smoke tests with helm install. Running QA on (currently) 3 different cloud providers ends up being very expensive and a bit redundant.

Overall, I wanted to open this issue to capture some thoughts that came up while tinkering with vcluster. Some of these thoughts are still young and so I'm very open to other opinions, historical context, and alternative solutions - as always. The main goals here are to:

Ensure our pipelines run reasonable tests at reasonable times.
Minimize the pipeline runtime as much as possible.
Minimize cloud infrastructure costs as much as possible.
Reduce the amount of time Distribution engineers debug transient errors and retry pipelines, and free up time for more meaningful work.

Edited Sep 12, 2023 by Mitchell Nielsen