Draft: PoC: Pods architecture using decomposed cluster-wide tables approach
TL;DR
This is a proof of concept trying to model some aspects Pods Stateless Router Proposal
.
What it does?
Iteration 1 (concluded with recording on 12th of October)
Those are initial changes implemented in PoC and presented via https://youtu.be/mUcALjn-yqQ.
- Application behavior: Focuses on modeling how GitLab would work in something similar to the described architecture.
- Focus on unknowns: Chooses to ignore all known problems since we know how they could be solved, with significant amount of effort
-
Classify tables: There are currently manual WIP to classify what could be part of
gitlab_users
. This PoC does implemented a more systematic approach by building a complete reference map between all tables (~500), adds context (pod or cluster) to identify references (related or external). Based on that rebuild and provide a list of tables that are cluster-wide or pod-local. - Foreign keys: Gets rid of foreign keys between cluster-wide and pod-tables.
-
Use PostgreSQL schema: Based on tables classification, do create PostgreSQL schemas (
context
,pod_N
) to limit data visibility between "virtual Pods". This is needed to model how application would work if it would have access to data only inpod_0
andcontext
- Implement many Pods: It does create two Pods with a set of Cluster-wide tables with the help of PostgreSQL schemas
- Switch Pods: Extend performance bar to be able to dynamically switch between Pods to quickly test different features
- Passthrough Pod into Sidekiq: Any operation triggering Sidekiq Worker will passthrough Pod specification to allow workers to executed against selected Pod.
This follows write-sync approach: where all Pods can write to cluster-wide tables.
- A
pod_N
in general uses DB replica, but can write write tocluster-wide
tables directly by connecting to primary database => a region can probbly work fine with this approach - A set of cluster-wide tables is under
public
schema - A pod specific tables is under
pod_0
andpod_1
schema - A Rack/Sidekiq Middleware is added to configure
connection.schema_search_path = "public,pod_0|pod_1"
depending onselected_pod
Cookie to model switching organizations
Iteration 2 (in progress)
Those are additional changes implemented in this PoC to continue solution validation. Yet to be presented.
-
Router: Implement router to send pre-flight requests and
path_info
classification on Rails side: !102553 (merged). - GitLab-Shell and Gitaly: Fix support for Git Push and make it work with Router.
-
PreventClusterWrites: Implement mechanism to model
async writes
approach: where onlyPod 0
can write to cluster-wide tables. - QA: Work on fixing as many QA tests as possible.
This follows write-async approach: where only Pod 0 can write to cluster-wide tables.
- A
pod_N
always uses DB replica ofcluster-wide
tables and is expecting to observe latency on those tables to up-to 500ms => a region-first approach - A set of cluster-wide tables is under
public
schema - A pod specific tables is under
pod_0
andpod_1
schema - A Rack/Sidekiq Middleware is added to configure
connection.schema_search_path = "public,pod_0|pod_1"
depending onselected_pod
Cookie to model switching organizations - Only
pod_0
can write tocluster-wide
tables: this is enforced byPreventClusterWrites
Query Analyzer - The
pod_N
forwards write calls via API topod_0
: this is currently modelled by suppressingPreventClusterWrites
in a place where the write happens to identify places required to be changed - Some endpoints are forced by served by
pod_0
, like/admin
, or/-/profile
that require cluster-wide access.
What problems it ignores as we know that we can solve them?
-
Decompose cluster-wide tables: We know that we can decompose all
cluster-wide
tables (as we did that for CI decomposition). The biggest problem there is fixing all cross-join schemas. Using a single logical database with separate PostgreSQL schemas (cluster+pod_0
orcluster+pod_1
) allows to retain all existing cross-joins working, but still create a separate visibility between tables. -
Monotonic sequences: We know that we can handle ID sequences across all Pods in a non-conflicting way for things like
projects.id
orissues.id
. This PoC makes all PostrgreSQL sequences to be shared across allpod_0/pod_1
. - Loose foreign keys: The loose foreign keys needs to be updated to allow removal across different Pods
-
Partitioning: The partitioning code use
gitlab_partitions_dynamic
and gitlab_partitions_static. Since this is not compatible with
context, pod_N` approach all partitioned tables are for now converted into non-partitioned. - Sidekiq Cron: Only regular Sidekiq Workers are covered. In future each Pod would have its own Sidekiq Cron executor.
Problems to evaluate
-
Router: Current approach uses a single GitLab, pass a Cookie, and dynamically
search_schema_path
depending on selected Pod. Ideally router should understand (Workhorse?) or have a logic to route a request to a correct Pod based on information from GitLab Rails - Cross-Pod talking: a. fetch data from another Pod (like Project) b. aggregate data across all Pods c. schedule Sidekiq job in a context of another Pod d. route all requests (Controller, GraphQL and API) requests to correct Pod automatically
- Many versions of GitLab: A truly Pod architecture allows to run many different versions of GitLab at the same time, allowing to upgrade some customers less frequently than others, and in thus improving resiliency due to application bugs. In a model of decomposed shared cluster-wide tables this might not be possible, since we would require all nodes to run the same latest version of application if cluster-wide tables were updated.
- ...
Run it
- Configure
config/database.yml
withschema_search_path:
, ideally using a new DB - Run
scripts/decomposition/create-pods-database
- Run
bin/rake -t db:seed_fu
to seed development database - (Optionally) Run
scripts/decomposition/classify-pods-database
to fetch test DB and updategitlab_referenced.yml
# config/database.yml
development:
main:
database: gitlabhq_development_pods
schema_search_path: public,pod_0
ci:
database: gitlabhq_development_pods
database_tasks: false
schema_search_path: public,pod_0
Edited by Kamil Trzciński (Back 2025-01-01)