Break repository excess calculation into smaller, simpler queries

Copying from @msmiley's great write-up in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12769#note_525534422:

Make this query more resilient

corrective action - If we run the base_and_descendants CTE as a separate query and plug in the resulting list of namespace ids as an IN list in the main query's WHERE clause, then I think the odds of getting a horrible execution plan will be greatly reduced.

In this incident (as in several others), a mis-estimation in modeling query execution cost lead to choosing a very sub-optimal plan. In this case, the planner thought that there would be a huge number of intermediate result rows for part of the query, and if that had been true, the plan it chose would have been more reasonable.

We can help the planner be less susceptible to mis-estimation. Refactoring the query into 2 separate SQL statements (see below) would in this case help in several ways:

Reducing the join tree depth makes it less likely for a small mis-estimation to be amplified.
Using an IN list for the namespace ids lets the query planner use the column-level statistics for the "projects"."namespace_id" field to produce a more stable and accurate row count estimate.
Query planning itself is faster when there are fewer potential plan permutations for the optimizer to evaluate.
- Because we do not cache our execution plans, the planning time is a significant contributor to the overall db response time under normal circumstances.
- So splitting up this query has a real chance of actually improving db efficiency (certainly in terms of the db's CPU time, and possibly in terms of the rails client's db duration metric too).

I will add more details below, but wanted to go ahead and post this summary quickly.

Example

That query refactoring would look roughly like this in SQL:

/* Part 1: Get list of namespace ids. */

WITH RECURSIVE "base_and_descendants" AS (
        (SELECT "namespaces".*
         FROM "namespaces"
         WHERE "namespaces"."type" = 'Group'
           AND "namespaces"."id" = 889412)
      UNION
        (SELECT "namespaces".*
         FROM "namespaces",
              "base_and_descendants"
         WHERE "namespaces"."type" = 'Group'
           AND "namespaces"."parent_id" = "base_and_descendants"."id")
     )
SELECT id
FROM "base_and_descendants" AS "namespaces"
;

/* Part 2: Filter by the namespace ids using an IN-list rather than a join to a CTE. */

SELECT SUM((("project_statistics"."repository_size" + "project_statistics"."lfs_objects_size") - 10737418240))
FROM "projects"
INNER JOIN "project_statistics" ON "project_statistics"."project_id" = "projects"."id"
WHERE "projects"."namespace_id" IN (
    /* Using an IN-list instead of a CTE avoids cardinality mis-estimatation and allows better use of column-level stats. */
     889412,
    3702953,
    5882754,
    5882903,
    9907712
  )
  AND ("project_statistics"."repository_size" + "project_statistics"."lfs_objects_size") > 10737418240
  AND "projects"."repository_size_limit" IS NULL
;

The code for this calculation appears to be in https://gitlab.com/gitlab-org/gitlab/blob/31d878a69d1e14660612d1729eece36827034461/ee/app/models/ee/namespace.rb#L434-455.

Edited Mar 09, 2021 by Stan Hu