Use default and max workspace resources on workspace reconcile
What does this MR do and why?
Issue: Backend: Add logic for using the agent's defaul... (#427144 - closed)
Use default and max workspace resources on workspace reconcile.
With Workspace config_version 2 migration (!131402 - merged) , we no longer needed desired_config_generator_prev1
and devfile_parser_prev1
since all non-terminated workspaces have been migrated to config version 2. However, for this change, we need to introduce a new config version. So instead of removing the files in one MR and then reintroducing them in this MR, I've just updated the files directly in this MR.
This MR is broken down into 4 commits for easier reviewing.
- Update previous versions of workspace resources generation
- Update desired_config_generator_prev1 with contents of desired_config_generator
- Update devfile_parser_prev1 with contents of devfile_parser
- Update remote development shared contexts
- Apply container resoruce default and create resource quota
- Generate resource quota using the agent's max_resources_per_workspace.
- Add annotation for SHA256 of max_resources_per_workspace to force workspace restart when value changes.
- Deep merge the default_resources_per_workspace_container into the containers and init containers of the workspace to apply the defaults.
- Update version of newly created workspaces
- Update naming convention for config versions ( Related - Revisit versioning of #create_config_to_apply i... (#425227 - closed) )
- Rename all
_prev1
files to_v2
to make the workspace config version explicit in these files
- Rename all
This MR will be followed by migration in Rails: Migrate workspaces with config_version=2... (#434494 - closed) to migrate non-terminated workspaces from config version 2 to 3.
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.
Setup
-
Set the remote development agent config where the
max_resources_per_workspace
anddefault_resources_per_workspace_container
is not set.remote_development: enabled: true dns_zone: workspaces.localdev.me network_policy: enabled: true egress: - allow: 0.0.0.0/0 except: - 10.0.0.0/8 - 172.16.0.0/12 - 192.168.0.0/16 - allow: 172.16.123.1/32
-
Create a devfile in a project
schemaVersion: 2.2.0 components: - name: gitlab-ui attributes: gl/inject-editor: true container: image: registry.gitlab.com/gitlab-org/remote-development/gitlab-remote-development-docs/debian-bullseye-ruby-3.2.patched-golang-1.20-rust-1.65-node-18.16-postgresql-15@sha256:216b9bf0555349f4225cd16ea37d7a627f2dad24b7e85aa68f4d364319832754 env: - name: STORYBOOK_HOST value: "0.0.0.0" endpoints: - name: storybook targetPort: 9001 secure: true protocol: http memoryLimit: "2048Mi" cpuLimit: "2.3"
-
Create a new workspace for this project. Open this workspace and create a new file in it called
TEST.md
and type something. This would be later used for some validations.
default_resources_per_workspace_container
behaviour
Verifying -
Set the
default_resources_per_workspace_container
in the remote development agent configremote_development: enabled: true dns_zone: workspaces.localdev.me network_policy: enabled: true egress: - allow: '0.0.0.0/0' except: - '10.0.0.0/8' - '172.16.0.0/12' - '192.168.0.0/16' - allow: '172.16.123.1/32' default_resources_per_workspace_container: limits: cpu: "1.5" memory: "786Mi" requests: cpu: "0.6" memory: "512Mi"
-
This will result in the pod for the existing workspace being terminated and a new pod is being created because the
default_resources_per_workspace_container.resources.requests
are used/merged to generate the workspace's config during reconciliation. You can verify this by performingkubectl describe po
and checking the container'sresources.request
and verifying that it matches the agent'sdefault_resources_per_workspace_container.resources.requests
. -
Once the workspace is ready, open the workspace and verify that it contains the
TEST.md
file. -
Thus, any change in the agent's
default_resources_per_workspace_container
results in all workspaces being immediately restarted and the value being enforced without losing any data in the workspace.
max_resources_per_workspace
behaviour
Verifying -
Set the
max_resources_per_workspace
in the remote development agent configremote_development: enabled: true dns_zone: workspaces.localdev.me network_policy: enabled: true egress: - allow: '0.0.0.0/0' except: - '10.0.0.0/8' - '172.16.0.0/12' - '192.168.0.0/16' - allow: '172.16.123.1/32' default_resources_per_workspace_container: limits: cpu: "1.5" memory: "786Mi" requests: cpu: "0.6" memory: "512Mi" max_resources_per_workspace: limits: cpu: "5" memory: "5Gi" requests: cpu: "3" memory: "3Gi"
-
This will result in the pod for the existing workspace being terminated and a new pod is being created because the
max_resources_per_workspace
has changed and it is used to generate the Kubernetes Resource Quota during reconciliation. The reason it restarts is because we have added an annotation on the workspace pod which is a SHA256 of the agent'smax_resources_per_workspace
value. You can verify this by performingkubectl describe po
and checking the pod's annotations. You can check the resource quota generated by runningkubectl describe resourcequota
. -
Once the workspace is ready, open the workspace and verify that it contains the
TEST.md
file. -
Thus, any change in the agent's
max_resources_per_workspace
results in all workspaces being immediately restarted and the value being enforced without losing any data in the workspace. -
Update the agent's
max_resources_per_workspace.limits.cpu
to2
andmax_resources_per_workspace.requests.cpu
to1.8
. -
This will result in the pod for the existing workspace being terminated but no new pod being created. This is because the workspace's devfile have the
cpuLimit
set to2.3
and the agent'smax_resources_per_workspace.limits.cpu
is set to2
. Thus, it is violating the constraints. You can further verify this by doingkubectl get rs -o yaml
and checking the status of the latest replica set which will have a message similar to -message: 'pods "workspace-10-1-1mljkt-7bbcb9698-qp95n" is forbidden: exceeded quota
. Eventually(after 10 minutes), the workspace will have an actual state ofFailed
.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.