Skip to content

Compress oversized Sidekiq job payload before dispatching into Redis

What does this MR do?

In gitlab-com/gl-infra/scalability#825 (closed), we started to about limiting the size of Sidekiq job payload, the first iteration is to implement a Sidekiq client middleware that allows tracking the occurrences of oversized payload (!53829 (merged)).

In the current iteration (gitlab-com/gl-infra/scalability#1054 (closed)), we are looking for an actual solution to resolve all the workers with oversized job payload. Eventually, we picked the compression approach as the solution. In summary,

  • All job payloads when exceeding threshold will be compressed into Gzip format using Ruby's built-in Zlib. The original job argument list (job['args']) is replaced my the compressed version, compressed and original_job_size_bytes are added into the job payload to denote the compression and the original size (for observability purpose).
  • Afterward, the job payloads are restored by a server middleware on the top of the server middleware stack and processed as normal.
  • If the job payload after compression exceeds the size limit, the job is discarded.

This approach has multiple significant advantages over other approaches:

  • This is a Catch 'Em All solution. From gitlab-com/gl-infra/scalability#1054 (closed), we discovered that there are over 20 workers having this issue. Fixing all of them manually requires a huge amount of coding and collaboration effort from different stage groups. Even if the operation successes, this issue may come back and haunt us in unexpected ways. Following this approach, all the jobs from all workers still be handled automatically, and transparently from the above layers.
  • Everything is encapsulated inside the application layer. We don't need administrator interference or additional external dependencies.
  • The compression rate is good in most cases (gitlab-com/gl-infra/scalability#1054 (comment 568129605))

Of course, it has some trade offs:

  • The compression rate depends heavily on the shape of the data. JSON and plain texts are most effective (up to 10x smaller size). Already-compressed base-64 data are the least effective (up to 3x smaller size). Thankfully, all of oversized job payloads are JSON, except for EmailReceiverWorker's payload. That is a different story because the jobs are pushed directly from mailroom and we can't do anything about that. Hence, I could say this downside is under controll.
  • Compression is CPU-intensive, especially in Ruby. If we put the compression threshold too low, meaning the compression frequency is high, other web/api/git requests in the same process are effected. I analyzed the data in gitlab-com/gl-infra/scalability#1054 (comment 567044321), 100 kb is the right balance.

To achieve the above change, this MR implements some changes:

  • A new compressor class to compress/decompress and modify the job payload.
  • Add compressed and original_job_size_bytes fields to the structured logs.
  • In the last iteration, I introduced two modes: track and raise modes. The raise mode is not needed anymore. I replaced it with compress mode. This mode depends on one one new variable: GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES
  • A new server middleware. This middle simply calls the aforementioned compressor.

What happens after this MR is merged?

Right after this MR is merged, we observed no changes. We already set the track mode on production (gitlab-com/gl-infra/production#4487 (closed)), they should continue to work.

The full roll out strategy looks like this:

  • Set the environment variables on Staging, on both VM and K8s. Replicate the following testing scenarios on the UI on Staging and confirm it works as intended.
  • Set the environment variables on Production. The size limit is set to the maximum observed job payload: 50MB.
  • Slow down the size limit to the desired number (5MB?)
  • (Unsure) replace the environment variables by configuration.

Testing Scenarios

All the following testing scenarios are done by dispatching a job with desired payload in Rails console:

  • Start Rails console with testing environment variables
  • Generate the job payload and dispatch the jobs with perform_async
  • Observe the result, from either logs or UI.

Note: in the captured screenshots, the logs are filtered with jq to display the related fields only.

Testing with sample Push data

WebHookWorker.perform_async(WebHook.last.id,  Gitlab::DataBuilder::Push::SAMPLE_DATA, 'push')

Sample JSON data (1.4kb)
{
  "object_kind":"push",
  "event_name":"push",
  "before":"95790bf891e76fee5e1747ab589903a6a1f80f22",
  "after":"da1560886d4f094c3e6c9ef40349f7d38b5d27d7",
  "ref":"refs/heads/master",
  "checkout_sha":"da1560886d4f094c3e6c9ef40349f7d38b5d27d7",
  "message":"Hello World",
  "user_id":4,
  "user_name":"John Smith",
  "user_email":"john@example.com",
  "user_avatar":"https://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=8://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=80",
  "project_id":15,
  "project":{
    "id":15,
    "name":"gitlab",
    "description":"",
    "web_url":"http://test.example.com/gitlab/gitlab",
    "avatar_url":"https://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=8://s.gravatar.com/avatar/d4c74594d841139328695756648b6bd6?s=80",
    "git_ssh_url":"git@test.example.com:gitlab/gitlab.git",
    "git_http_url":"http://test.example.com/gitlab/gitlab.git",
    "namespace":"gitlab",
    "visibility_level":0,
    "path_with_namespace":"gitlab/gitlab",
    "default_branch":"master"
  },
  "commits":[
    {
      "id":"c5feabde2d8cd023215af4d2ceeb7a64839fc428",
      "message":"Add simple search to projects in public area\n\ncommit message body",
      "title":"Add simple search to projects in public area",
      "timestamp":"2013-05-13T18:18:08+00:00",
      "url":"https://test.example.com/gitlab/gitlab/-/commit/c5feabde2d8cd023215af4d2ceeb7a64839fc428",
      "author":{
        "name":"Test User",
        "email":"test@example.com"
      }
    }
  ],
  "total_commits_count":1,
  "push_options":{
    "ci":{
      "skip":true
    }
  }
}	

Scenario 1: Track mode: job payload size is less than the size limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=track
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=2000 

Expected behavior: the testing job is not compressed and dispatched successfully.

Screen_Shot_2021-05-14_at_15.29.42

Scenario 2: Track mode: job payload size is greater than the size limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=track 
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=1000 

Expected behavior: the testing job is not compressed and dispatched successfully, but an exception is pushed to Sentry

Screen_Shot_2021-05-14_at_15.29.51

The exception is captured by Sentry:

Screen_Shot_2021-05-14_at_15.27.51

Scenario 3: Compress mode: job payload size is less than the compression limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress 
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=10000 
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=3000

Expected behavior: the testing job is not compressed and dispatched successfully.

Screen_Shot_2021-05-14_at_15.29.51

Scenario 4: Compress mode: job payload size is more than compression threshold but less than the limit
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress 
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=1000 
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=300

Expected behavior: the testing job payload is compressed then processed successfully

Screen_Shot_2021-05-14_at_15.15.31

The webhook log is shown on the UI:

Screen_Shot_2021-05-14_at_15.24.42

Scenario 5: Compress mode: job payload size is more than compression threshold and more than the limit after compress
GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress 
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=500 
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=300

Expected behavior: the testing job is not dispatched, and an exception is raised

Screen_Shot_2021-05-14_at_15.20.21

Testing with real big payload data

Testing with sample data is not enough. In gitlab-com/gl-infra/scalability#1054 (comment 567592064), I collected a series of real-world oversized job payloads. The following scenarios are tested with the following environment variables, which are also the ones should be set on production:

GITLAB_SIDEKIQ_SIZE_LIMITER_MODE=compress 
GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES=10000000
GITLAB_SIDEKIQ_SIZE_LIMITER_COMPRESSION_THRESHOLD_BYTES=100000

Scenario 6: Push hook

Payload: 24MB push_hooks_max.json

WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('push_hooks_max.json'))  , 'push')

Expected behavior: the job payload is compressed, and dispatched successfully

Screen_Shot_2021-05-14_at_15.51.47

The webhook log is shown on the UI:

Screen_Shot_2021-05-14_at_15.52.34

Scenario 7: Pipeline hook

Payload: 2MB pipeline_hooks_99th.json

WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('pipeline_hooks_99th.json'))  , 'pipeline')

Expected behavior: the job payload is compressed, and dispatched successfully

Screen_Shot_2021-05-14_at_15.53.36

The webhook log is shown on the UI:

Screen_Shot_2021-05-14_at_15.53.59

Scenario 8: New note hook

Payload: 4MB new_notes_max.json

WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('new_notes_max.json'))  , 'new_notes')

Expected behavior: the job payload is compressed, and dispatched successfully

Screen_Shot_2021-05-14_at_15.54.31

The webhook log is shown on the UI:

Screen_Shot_2021-05-14_at_15.55.30

Scenario 9: Merge request hook

Payload: 47MB merge_request_hooks_max.json

WebHookWorker.perform_async(WebHook.last.id, JSON.parse(File.read('merge_request_hooks_max.json'))  , 'merge_request')

Expected behavior: the job is not dispatched, an exception is raised. After compression, the payload size slightly surpasses 10MB (from 47MB). It's a good compression ratio, but still doesn't make it.

Screen_Shot_2021-05-14_at_15.55.55

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Quang-Minh Nguyen

Merge request reports

Loading