Quickly resolve issues with your Cleanup policy with improved validation and notifications
Problem to solve
Administrators would like to use the GitLab Cleanup policy for tags at the Project level so that they can programmatically identify which tags should be removed or retained. To do so, they define an interval, schedule, and use regular expression to define a tag name to remove/retain.
When an expiration policy has failed due to an invalid regular expression, you need to be notified, so that you can fix the issue as quickly as possible.
Intended users
- Delaney (Development Team Lead)
- Sasha (Software Developer)
- Devon (DevOps Engineer)
- Sidney (Systems Administrator)
Proposal
When a Project's Cleanup Policy has failed to run, we will notify the Project's Owner/Admin with a helpful error message via email and the UI.
Email notification
Subject: Cleanup policy has failed for project
Body:
- Project
- Policy
- Error
- Link to documentation on acceptable regex
User experience
The UI shows the alert message and a highlighted field w/ a specific error message when they land on their Project's CI/CD settings page (where the Tag Cleanup Policy is located)
User experience goal
- When a policy fails to run due to a regular expression issue, the user is notified that the job failed, why, and how to fix it.
UX Questions
- Who should see these errors?
- @icamacho I think we should only show the error to those that have the power to fix it. I believe this is Project Owner/Admin.
- What copy should we use for the email/UI?
Further details
Technical considerations
The execution error happens in a worker and this process is in the background = not connected to the UI. What we could do is have the worker save the error message for the container expiration policy and when the user visits the UI the error message is displayed.
Permissions and Security
- There are no permissions changes required for this change
Documentation
- https://docs.gitlab.com/ee/user/packages/container_registry/#expiration-policy
- https://docs.gitlab.com/ee/api/projects.html
Availability & Testing
What does success look like, and how can we measure that?
Success looks like we see a higher success ratio of policies successfully run/failed to run and that Admin can rely on the feature to work for their project.
Metrics
- We will measure this by looking at the overall adoption of the feature
- @10io is it possible to track the number of jobs that succeeded/failed?