GitHub importer: Skip queuing attachment workers for objects with no attachments

About

The GitHub importer's attachment stage currently loops through all imported note, MR, issue and release objects and queues a worker per object.

Each worker then looks at the note or description text for attachments. If there are no attachments, the worker does nothing.

Problem

When we queue a worker, we spread them to run 1,000 per minute. The more workers queued, the longer the spread across time.

Because most workers in the attachments stage will do nothing, this spread adds unnecessary time to an import. The larger the import, the more time is wasted in this stage.

To take the large kubernetes import as an example https://gitlab.com/gitlab-org/manage/import-and-integrate/discussions/-/issues/61#note_1695770492 focussing only on importing the notes attachments (ignoring the MRs, issues and releases description attachments):

There were 1,761,570 workers queued (one per note).
Spread 1,000 per minute means a spread over 29.3 hours (note, after the spreading fix #432137 (comment 1701524385)).

Let's consider 1 in 500 notes have attachments. If we only queued a worker per note with an attachment:

There would be 3,523 workers queued
The spread would across 3.5 minutes

Even if 1 in 50 notes had attachments, the spread would only be across 35 minutes.

In both of these times, the spread would be less than the duration of the stage worker itself. So the waiting involved in this stage would be just the duration of the stage worker itself. In kubernetes this stage worker took 6 hours.

Proposal

Have the attachment stage importer scan for attachments in text and only queue workers for objects that have attachments

Opportunity

So to take kubernetes as an example, the attachments stage would take:

Before: waiting 29.3 hours
After: waiting 6 hours (the duration of the stage worker to complete)

Note: A benchmark to get an indication of the increase work done by the stage worker to scan for attachments showed the extra time spent by the stage worker would be negligible.

Click to see benchmark

# Run on my development rails console.

require 'benchmark/ips'

TEXT_WITH_ATTACHMENTS = (
  ["![special-image](https://user-images.githubusercontent.com/1/uuid-1.png)"] + 
  FFaker::Lorem.paragraphs + 
  ["![special-image](https://user-images.githubusercontent.com/1/uuid-2.png)"] + 
  FFaker::Lorem.paragraphs
).join("\n")
TEXT_WITH_NO_ATTACHMENTS = (FFaker::Lorem.paragraphs + FFaker::Lorem.paragraphs).join("\n")
NOTE_WITH_ATTACHMENTS = Note.new(note: TEXT_WITH_ATTACHMENTS)
NOTE_WITH_NO_ATTACHMENTS = Note.new(note: TEXT_WITH_NO_ATTACHMENTS)

Benchmark.ips do |x|
  x.report("test with attachments") do 
    Gitlab::GithubImport::Representation::NoteText.from_db_record(NOTE_WITH_ATTACHMENTS).has_attachments?
  end

  x.report("test with no attachments") do 
    Gitlab::GithubImport::Representation::NoteText.from_db_record(NOTE_WITH_NO_ATTACHMENTS).has_attachments?
  end
end

# Result: test with attachments: 193.609k iterations in 5.073431s
# Result: test with no attachments: 453.135k in 5.082058s
#
# The extra work time is negligible

Second note: If the spread were to be increased to >= 5,000 per minute there would be no speed benefit of this change. If the spread were increased to 3,000 per minute there would still be a speed benefit of 3.5 hours.

Risks

It involves the attachments stage worker scanning text for attachments (note, not downloading them, just scanning) which will increase the memory and CPU consumption of that stage worker. In a very large import we would hope that memory would be freed though. But this change should be feature flagged.

Edited Dec 21, 2023 by Luke Duncalfe