Skip to content

Upsert missing or inactive services records for projects with available Prometheus application installed on shared k8s cluster

Queries

Plans are retrieved from staging.gitlab.com due to lack of access to production:

/app/lib/chatops/database/read_only_connection.rb:42:in `async_exec': ERROR:  cannot execute SELECT in a read-only transaction (PG::ReadOnlySqlTransaction)

https://ops.gitlab.net/gitlab-com/chatops/-/jobs/938561

Create missing records for GitLab instances, with k8s cluster shared on group level

WITH created_records AS (
  INSERT INTO services (project_id, "active","properties","type","template","push_events","issues_events","merge_requests_events","tag_push_events","note_events","category","default","wiki_page_events","pipeline_events","confidential_issues_events","commit_events","job_events","confidential_note_events", created_at, updated_at)
  SELECT "projects"."id", true,'{}','PrometheusService',false,true,true,true,true,true,'monitoring',false,true,true,true,true,true,true, TIMEZONE('UTC', NOW()) as created_at, TIMEZONE('UTC', NOW()) as updated_at FROM "projects" LEFT JOIN services ON services.project_id = projects.id AND services.project_id BETWEEN 1 AND 10000
                    AND services.type = 'PrometheusService' INNER JOIN cluster_groups ON cluster_groups.group_id = projects.namespace_id INNER JOIN clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = cluster_groups.cluster_id
                      AND clusters_applications_prometheus.status IN (3, 5) WHERE (projects.id BETWEEN 1 AND 10000 AND services.id IS NULL)
  RETURNING *
)
SELECT COUNT(*) as number_of_created_records
FROM created_records

plan: https://explain.depesz.com/s/5pLM Summary:

Time: 4.637 s
  - planning: 4.576 ms
  - execution: 4.632 s
    - I/O read: 4.550 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 1075 (~8.40 MiB) from the buffer pool
  - reads: 3987 (~31.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 50 (~400.00 KiB)
  - writes: 0

with index:

plan: https://explain.depesz.com/s/WgOq

Summary:

Time: 4.306 s
  - planning: 5.356 ms
  - execution: 4.301 s
    - I/O read: 4.222 s
    - I/O write: 0.000 s

Shared buffers:
  - hits: 11524 (~90.00 MiB) from the buffer pool
  - reads: 3989 (~31.20 MiB) from the OS file cache, including disk I/O
  - dirtied: 47 (~376.00 KiB)
  - writes: 0

Update Prometheus services records to active for GitLab instances, with k8s cluster shared on group level

WITH updated_records AS (
  UPDATE services SET active = TRUE
  WHERE services.project_id BETWEEN 1 AND 10000 AND services.properties = '{}' AND services.type = 'PrometheusService'
  AND services.active = FALSE AND EXISTS (
  SELECT 1 FROM "projects" INNER JOIN cluster_groups ON cluster_groups.group_id = projects.namespace_id INNER JOIN 
clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = cluster_groups.cluster_id
                      AND clusters_applications_prometheus.status IN (3, 5) WHERE (projects.id BETWEEN 1 AND 10000)
) 

  RETURNING *
)
SELECT COUNT(*) as number_of_updated_records
FROM updated_records

plan: https://explain.depesz.com/s/eiLvf Summary:

Time: 5.090 s
  - planning: 3.820 ms
  - execution: 5.086 s
    - I/O read: 4.970 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 11524 (~90.00 MiB) from the buffer pool
  - reads: 3987 (~31.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 47 (~376.00 KiB)
  - writes: 0

with index:

plan: https://explain.depesz.com/s/lEcG

Summary:

Time: 28.259 ms
  - planning: 5.396 ms
  - execution: 22.863 ms
    - I/O read: 0.000 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 15509 (~121.20 MiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0
  - writes: 0

Check if instance level cluster is available for given GitLab instance

SELECT 1 FROM "clusters" INNER JOIN clusters_applications_prometheus ON clusters_applications_prometheus.cluster_id = clusters.id AND clusters_applications_prometheus.status IN (3, 5) WHERE "clusters"."cluster_type" = 1

plan: https://explain.depesz.com/s/fee8 Summary:

Time: 362.272 ms
  - planning: 0.341 ms
  - execution: 361.931 ms
    - I/O read: 335.734 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 7303 (~57.10 MiB) from the buffer pool
  - reads: 904 (~7.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 20 (~160.00 KiB)
  - writes: 0

with index:

plan: https://explain.depesz.com/s/BJot

Summary

Time: 635.776 ms
  - planning: 0.298 ms
  - execution: 635.478 ms
    - I/O read: 613.851 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 8577 (~67.00 MiB) from the buffer pool
  - reads: 908 (~7.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 18 (~144.00 KiB)
  - writes: 0

Create missing records for GitLab instances, with k8s cluster shared on instance level

WITH created_records AS (
  INSERT INTO services (project_id, "active","properties","type","template","push_events","issues_events","merge_requests_events","tag_push_events","note_events","category","default","wiki_page_events","pipeline_events","confidential_issues_events","commit_events","job_events","confidential_note_events", created_at, updated_at)
  SELECT "projects"."id", true,'{}','PrometheusService',false,true,true,true,true,true,'monitoring',false,true,true,true,true,true,true, TIMEZONE('UTC', NOW()) as created_at, TIMEZONE('UTC', NOW()) as updated_at FROM "projects" LEFT JOIN services ON services.project_id = projects.id AND services.project_id BETWEEN 1 AND 10000
                    AND services.type = 'PrometheusService' WHERE (projects.id BETWEEN 1 AND 10000 AND services.id IS NULL)
  RETURNING *
)
SELECT COUNT(*) as number_of_created_records
FROM created_records

plan: https://explain.depesz.com/s/dSUz Summary:

Time: 5.195 s
  - planning: 1.602 ms
  - execution: 5.193 s
    - I/O read: 369.044 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 102002 (~796.90 MiB) from the buffer pool
  - reads: 607 (~4.70 MiB) from the OS file cache, including disk I/O
  - dirtied: 304 (~2.40 MiB)
  - writes: 0

whit index:

plan: https://explain.depesz.com/s/GPqU

Summary

Time: 4.697 s
  - planning: 0.985 ms
  - execution: 4.696 s
    - I/O read: 378.316 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 117088 (~914.80 MiB) from the buffer pool
  - reads: 1509 (~11.80 MiB) from the OS file cache, including disk I/O
  - dirtied: 289 (~2.30 MiB)
  - writes: 0

Update Prometheus services records to active for GitLab instances, with k8s cluster shared on instance level

WITH updated_records AS (
  UPDATE services SET active = TRUE
  WHERE services.project_id BETWEEN 1 AND 10000 AND services.properties = '{}' AND services.type = 'PrometheusService'
  AND services.active = FALSE AND 1 = 1
  RETURNING *
)
SELECT COUNT(*) as number_of_updated_records
FROM updated_records

plan: https://explain.depesz.com/s/e5wf Summary:

Time: 13.651 ms
  - planning: 0.364 ms
  - execution: 13.287 ms
    - I/O read: 8.689 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 3 (~24.00 KiB) from the buffer pool
  - reads: 295 (~2.30 MiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

with index:

plan: https://explain.depesz.com/s/xUke Summary

Time: 1.304 ms
  - planning: 0.442 ms
  - execution: 0.862 ms
    - I/O read: 0.624 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 0 from the buffer pool
  - reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

gitlab.com

Database State:

gitlab.com database: There is 6_496_601 rows in namespaces table

/chatops run explain select id from namespaces
EXPLAIN output 
Index Only Scan using namespaces_pkey on namespaces  (cost=0.43..124451.56 rows=6491743 width=4) 
(actual time=0.077..1708.589 rows=6496601 loops=1)

There is 3_543_406 rows in services table

/chatops run explain select 1 from services
EXPLAIN output 
Index Only Scan using index_services_on_project_id on services  (cost=0.43..130977.60 rows=3543406 width=4) 
(actual time=0.251..4525.741 rows=3549176 loops=1)

There is 17_435 rows in services table of type PrometheusService

/chatops run explain select 1 from services where type = ‘PrometheusService’
EXPLAIN output 
Index Only Scan using index_services_on_type on services  (cost=0.56..1091.62 rows=16961 width=4) 
(actual time=2.877..156.291 rows=17435 loops=1)

There is 12_341_591 rows in projects table

/chatops run explain select id from projects
EXPLAIN output 
Index Only Scan using projects_pkey on projects  (cost=0.43..300956.89 rows=12342061 width=4) 
(actual time=0.100..6313.379 rows=12341591 loops=1)

From previous approach !19956 (merged)

Affected rows on gitlab.com 5621 (updated: 22 and inserted: 5599)

What does this MR do?

This is another approach to fix state of services relation. Due to a bug, projects which was connected to group or instance level shared Kubernetes clusters, hadn't have created corresponding services entries when Prometheus application was installed on this clusters.

Previous attempts: !19956 (merged) and !22974 (merged) was reverted due to number of performance issues. First of all there is important difference between gitlab.com and staging.gitlab.com. Staging has instance level shared cluster, which means that almost all of projects there are affected. While gitlab.com has only group level shared clusters, which gives roughly around 6000 affected rows. Due to this difference approaches tailored for gitlab.com, flooded sidekiq with excessive number of background jobs https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9160 Another thing to remember is that background job has minimum lease time of 2 minutes, so work should be scheduled in bigger chunks in order to avoid idle waiting for minimum lease to expire (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9064#note_279857215).

This implementation is inspired by !19956 (comment 287626170) however since there is no uniq constraint on services table except for primary key, using ON CONFLICT was not possible. We may still decide to add such constraint, but I tried to find a way to avoid that, as it is not a cheap operation.

Screenshots

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Reports #14857 (closed)

Edited by 🤖 GitLab Bot 🤖

Merge request reports

Loading