Enable de-duplication of the ElasticCommitIndexerWorker jobs (!31500) · Merge requests · GitLab.org / GitLab

Micaël Bergeron requested to merge 205178-change-repository-indexing-to-sorted-sets-algorithm into master May 08, 2020

What does this MR do?

This MR changes the way the commit and blob indexer issues work to the underlying worker.

Before this change, each indexation event (push, commit, …) would enqueue a job with the range of commit to index.

In order to improve the handling of jobs, this MR defer the selection of the commit range when the job runs, such as the indexation always run for LAST_INDEXED_COMMIT..HEAD.

With that change, we can now toggle the job queue to be idempotent and thus de-duplicate redundant jobs.

Future iterations

Use a git ref instead of the `index_status`

Instead of writing to the database the index status, could we create a git ref in the repository with the indexation status?

That would prevent having to query the database and would enable the full processing to happen in the gitlab-elasticsearch-indexer.

# create the ref
git update-ref refs/elasticsearch/master $(git rev-parse master)

# then you can use it as a Git object
git diff refs/elasticsearch/master..master

Buffer the index updates

We should implement the same logic as Gitlab::Elastic::BulkIndexer, such as the Elasticsearch updates are bulked into a buffer that gets flushed periodically.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.

Related to #205178

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

Enable de-duplication of the ElasticCommitIndexerWorker jobs