Compress oversized Sidekiq job payload before dispatching into Redis
What does this MR do?
In gitlab-com/gl-infra/scalability#825 (closed), we started to about limiting the size of Sidekiq job payload, the first iteration is to implement a Sidekiq client middleware that allows tracking the occurrences of oversized payload (!53829 (merged)).
In the current iteration (gitlab-com/gl-infra/scalability#1054 (closed)), we are looking for an actual solution to resolve all the workers with oversized job payload. Eventually, we picked the compression approach as the solution. In summary,
- All job payloads when exceeding threshold will be compressed into Gzip format using Ruby's built-in Zlib. The original job argument list (
job['args']
) is replaced my the compressed version,compressed
andoriginal_job_size_bytes
are added into the job payload to denote the compression and the original size (for observability purpose). - Afterward, the job payloads are restored by a server middleware on the top of the server middleware stack and processed as normal.
- If the job payload after compression exceeds the size limit, the job is discarded.
This approach has multiple significant advantages over other approaches:
- This is a Catch 'Em All solution. From gitlab-com/gl-infra/scalability#1054 (closed), we discovered that there are over 20 workers having this issue. Fixing all of them manually requires a huge amount of coding and collaboration effort from different stage groups. Even if the operation successes, this issue may come back and haunt us in unexpected ways. Following this approach, all the jobs from all workers still be handled automatically, and transparently from the above layers.
- Everything is encapsulated inside the application layer. We don't need administrator interference or additional external dependencies.
- The compression rate is good in most cases (gitlab-com/gl-infra/scalability#1054 (comment 568129605))
Of course, it has some trade offs:
- The compression rate depends heavily on the shape of the data. JSON and plain texts are most effective (up to 10x smaller size). Already-compressed base-64 data are the least effective (up to 3x smaller size). Thankfully, all of oversized job payloads are JSON, except for EmailReceiverWorker's payload. That is a different story because the jobs are pushed directly from mailroom and we can't do anything about that. Hence, I could say this downside is under controll.
- Compression is CPU-intensive, especially in Ruby. If we put the compression threshold too low, meaning the compression frequency is high, other web/api/git requests in the same process are effected. I analyzed the data in gitlab-com/gl-infra/scalability#1054 (comment 567044321), 100 kb is the right balance.
To achieve the above change, this MR implements some changes:
- A new compressor class to compress/decompress and modify the job payload.
- Add
compressed
andoriginal_job_size_bytes
fields to the structured logs. - In the last iteration, I introduced two modes: track and raise modes. The raise mode is not needed anymore. I replaced it with
compress
mode. This mode depends on one one new variable:GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES
- A new server middleware. This middle simply calls the aforementioned compressor.
What happens after this MR is merged?
Right after this MR is merged, we observed no changes. We already set the track mode on production (gitlab-com/gl-infra/production#4487 (closed)), they should continue to work.
The full roll out strategy looks like this:
- Set the environment variables on Staging, on both VM and K8s. Replicate the following testing scenarios on the UI on Staging and confirm it works as intended.
- Set the environment variables on Production. The size limit is set to the maximum observed job payload: 50MB.
- Slow down the size limit to the desired number (5MB?)
- (Unsure) replace the environment variables by configuration.
Testing Scenarios
All the following testing scenarios are done by dispatching a job with desired payload in Rails console:
- Start Rails console with testing environment variables
- Generate the job payload and dispatch the jobs with
perform_async
- Observe the result, from either logs or UI.
Note: in the captured screenshots, the logs are filtered with jq
to display the related fields only.
Testing with sample Push data
WebHookWorker.perform_async(WebHook.last.id, Gitlab::DataBuilder::Push::SAMPLE_DATA, 'push')
Sample JSON data (1.4kb)
{
"object_kind":"push",
"event_name":"push",
"before":"95790bf891e76fee5e1747ab589903a6a1f80f22",
"after":"da1560886d4f094c3e6c9ef40349f7d38b5d27d7",
"ref":"refs/heads/master",
"checkout_sha":"da1560886d4f094c3e6c9ef40349f7d38b5d27d7",
"message":"Hello World",
"user_id":4,
"user_name":"John Smith",
"user_email":"john@example.com",
"user_avatar":"https://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=8://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=80",
"project_id":15,
"project":{
"id":15,
"name":"gitlab",
"description":"",
"web_url":"http://test.example.com/gitlab/gitlab",
"avatar_url":"https://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=8://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=80",
"git_ssh_url":"git@test.example.com:gitlab/gitlab.git",
"git_http_url":"http://test.example.com/gitlab/gitlab.git",
"namespace":"gitlab",
"visibility_level":0,
"path_with_namespace":"gitlab/gitlab",
"default_branch":"master"
},
"commits":[
{
"id":"c5feabde2d8cd023215af4d2ceeb7a64839fc428",
"message":"Add simple search to projects in public area\n\ncommit message body",
"title":"Add simple search to projects in public area",
"timestamp":"2013-05-13T18:18:08+00:00",
"url":"https://test.example.com/gitlab/gitlab/-/commit/c5feabde2d8cd023215af4d2ceeb7a64839fc428",
"author":{
"name":"Test User",
"email":"test@example.com"
}
}
],
"total_commits_count":1,
"push_options":{
"ci":{
"skip":true
}
}
}
Expected behavior: the testing job is not compressed and dispatched successfully, but an exception is pushed to Sentry The exception is captured by Sentry:Scenario 2: Track mode: job payload size is greater than the size limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=track
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=1000
Expected behavior: the testing job is not compressed and dispatched successfully.Scenario 3: Compress mode: job payload size is less than the compression limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=10000
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=3000
Expected behavior: the testing job payload is compressed then processed successfully The webhook log is shown on the UI:Scenario 4: Compress mode: job payload size is more than compression threshold but less than the limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=1000
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=300
Expected behavior: the testing job is not dispatched, and an exception is raisedScenario 5: Compress mode: job payload size is more than compression threshold and more than the limit after compress
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=500
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=300
Testing with real big payload data
Testing with sample data is not enough. In gitlab-com/gl-infra/scalability#1054 (comment 567592064), I collected a series of real-world oversized job payloads. The following scenarios are tested with the following environment variables, which are also the ones should be set on production:
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=10000000
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=100000
Payload: 24MB push_hooks_max.json Expected behavior: the job payload is compressed, and dispatched successfully The webhook log is shown on the UI:Scenario 6: Push hook
WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('push_hooks_max.json')) , 'push')
Payload: 2MB pipeline_hooks_99th.json Expected behavior: the job payload is compressed, and dispatched successfully The webhook log is shown on the UI:Scenario 7: Pipeline hook
WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('pipeline_hooks_99th.json')) , 'pipeline')
Payload: 4MB new_notes_max.json Expected behavior: the job payload is compressed, and dispatched successfully The webhook log is shown on the UI:Scenario 8: New note hook
WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('new_notes_max.json')) , 'new_notes')
Payload: 47MB merge_request_hooks_max.json Expected behavior: the job is not dispatched, an exception is raised. After compression, the payload size slightly surpasses 10MB (from 47MB). It's a good compression ratio, but still doesn't make it.Scenario 9: Merge request hook
WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('merge_request_hooks_max.json')) , 'merge_request')
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team