Skip to content

Ensure primary key index is used for Geo::Replicable

Stan Hu requested to merge sh-geo-use-cte-for-local-files into master

What does this MR do and why?

In gitlab-com/gl-infra/production#7756 (closed), we found that for a large ci_job_artifacts table, the query planner can flip between using one of two indexes to find local files:

  1. ci_job_artifacts_pkey
  2. index_ci_job_artifacts_on_file_store

The first index is always faster because for a given primary ID, there is only one row and hence filtering by file_store = 1 is fast.

In the worst case on GitLab.com, the second index would have to sift through over 22,000 rows to find a matching artifact with the given primary ID.

Even if Geo is not enabled, save_verification_details is called with every model that is replicated hooks into save. To ensure that the query is always fast, use a materialized CTE to look up by primary key ID and filter from there.

Resolves #374701 (closed)

Resolves #374718 (closed)

SQL Queries

Before

When the query planner and statistics are in a good state, this query works just fine.

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/12244/commands/43526 shows:

SELECT 1 AS one FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."file_store" = 1 AND "ci_job_artifacts"."id" = 10 LIMIT 1
 Limit  (cost=0.58..3.60 rows=1 width=4) (actual time=0.088..0.088 rows=0 loops=1)
   Buffers: shared hit=7
   I/O Timings: read=0.000 write=0.000
   ->  Index Scan using ci_job_artifacts_pkey on public.ci_job_artifacts  (cost=0.58..3.60 rows=1 width=4) (actual time=0.086..0.086 rows=0 loops=1)
         Index Cond: (ci_job_artifacts.id = 10)
         Filter: (ci_job_artifacts.file_store = 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=7
         I/O Timings: read=0.000 write=0.000
Time: 0.302 ms
  - planning: 0.196 ms
  - execution: 0.106 ms
    - I/O read: 0.000 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 7 (~56.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

When things go bad!

For a bad case, see gitlab-com/gl-infra/production#7756 (comment 1104365277):

gitlabhq_production=# explain ( analyze, buffers ) SELECT 1 AS one FROM "ci_job_artifacts" WHERE "ci_job_artifacts"."file_store" = 1 AND "ci_job_artifacts"."id" = 3314157561 LIMIT 1 ;
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.58..2.10 rows=1 width=4) (actual time=30.011..30.012 rows=0 loops=1)
   Buffers: shared hit=18243
   ->  Index Scan using index_ci_job_artifacts_on_file_store on ci_job_artifacts  (cost=0.58..2.10 rows=1 width=4) (actual time=30.010..30.010 rows=0 loops=1)
         Index Cond: (file_store = 1)
         Filter: (id = '3314157561'::bigint)
         Rows Removed by Filter: 22630
         Buffers: shared hit=18243
 Planning Time: 0.128 ms
 Execution Time: 30.028 ms
(9 rows)

18,243 buffers were read. This by itself isn't terrible, but since CI job traces get saved over 15,000 times a minute (https://log.gprd.gitlab.net/goto/75ee0f90-3ab9-11ed-8656-f5f2137823ba), this can add up!

After

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/12244/commands/43528:

WITH "verifiables" AS MATERIALIZED
  (SELECT "ci_job_artifacts"."project_id", "ci_job_artifacts"."file_type", "ci_job_artifacts"."size", "ci_job_artifacts"."created_at", "ci_job_artifacts"."updated_at", "ci_job_artifacts"."expire_at", "ci_job_artifacts"."file", "ci_job_artifacts"."file_store", "ci_job_artifacts"."file_sha256", "ci_job_artifacts"."file_format", "ci_job_artifacts"."file_location", "ci_job_artifacts"."id", "ci_job_artifacts"."job_id", "ci_job_artifacts"."locked", "ci_job_artifacts"."partition_id"
   FROM "ci_job_artifacts"
   WHERE "ci_job_artifacts"."id" = 10)
SELECT 1 AS one
FROM "verifiables" AS "ci_job_artifacts"
WHERE "ci_job_artifacts"."file_store" = 1
LIMIT 1
 Limit  (cost=3.59..3.62 rows=1 width=4) (actual time=0.055..0.056 rows=0 loops=1)
   Buffers: shared hit=7
   I/O Timings: read=0.000 write=0.000
   CTE verifiables
     ->  Index Scan using ci_job_artifacts_pkey on public.ci_job_artifacts ci_job_artifacts_1  (cost=0.58..3.59 rows=1 width=147) (actual time=0.052..0.053 rows=0 loops=1)
           Index Cond: (ci_job_artifacts_1.id = 10)
           Buffers: shared hit=7
           I/O Timings: read=0.000 write=0.000
   ->  CTE Scan on verifiables ci_job_artifacts  (cost=0.00..0.02 rows=1 width=4) (actual time=0.054..0.054 rows=0 loops=1)
         Filter: (ci_job_artifacts.file_store = 1)
         Rows Removed by Filter: 0
         Buffers: shared hit=7
         I/O Timings: read=0.000 write=0.000
Time: 0.894 ms
  - planning: 0.782 ms
  - execution: 0.112 ms
    - I/O read: 0.000 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 7 (~56.00 KiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michael Kozono

Merge request reports

Loading