Make it possible perform complex and long-running migrations

Description

GitLab is growing fast. The amount of data in our database is also growing fast. We need a way to perform complex and long-running migrations in order to be able to keep our velocity of a development and our codebase in a good shape.

The problem is that we still do not have a good way to perform complex and long-running migrations. We recently improved our development techniques related to writing migrations and avoiding downtime, but even if we are now able to avoid downtime, we are still constrained by a time that is required to complete complex migrations.

I believe that this is a problem that we can no longer ignore, because this is a problem we need to solve for https://gitlab.com/gitlab-org/gitlab-ce/issues/26481 that is slated for %9.3. We foresee that this migration can take a lot time, much more that our current migration methods can support. Nonetheless, this is not only a ~CI problem. This is a general problem that is GitLab-wide. How to perform data migration,that despite being super-optimized can run for days (provided that we use current techniques and methods to write and execute migrations).

Currently, we have two types of migrations. Regular migrations and post-deployment migrations. Both kinds are using ActiveRecord::Migration stuff under the hood and are very similar. Neither one of these can be used to run migrations for a half of a day or more.

We were recently, with @ayufan and @yorickpeterse, thinking about what can be done to resolve this problem. I know that @smcgivern was also interested in some kind of a solution.

Let's use this issue to gather information about our options and to discuss which direction is the most promising one.

Database sharding - https://gitlab.com/gitlab-org/gitlab-ee/issues/1043

This is widely used technique in other companies that were able to hit a large-scale. Having multiple database servers, with different data on each, can make it possible to run migrations on each of shards, hence reducing the total amount of migration time significantly. This is not an easy and boring solution, but still something that might be worth to explore. High application-level database sharding is probably as difficult and low database-level sharding, and an issue to discuss the former is https://gitlab.com/gitlab-org/gitlab-ee/issues/1043.

Pros:

might solve a lot of other problems, especially performance

Cons:

difficult to implement

Resource-specific migrations

This is a new concept, authored by @ayufan. This assumes that we don't need to migrate everything at once, and that we won't be able to do it in the future. The idea is to trigger migrations for specific resources on-demand. In other words - trigger a migration for 20 pipelines if user wants to access the first page of pipelines table (using pagination). If someone clicks the second page - trigger a migration for the next 20 resources etc.

This is a very interesting idea, but is also not perfect. We managed to find some time to discuss technical details of this.

Pros:

It might save a lot of resources and time related to performing migrations, because it migrates only these resources that are in use, on-demand.

Cons:

This kind of migrations is not tied to a specific database schema. It means we need to support these migrations across versions. Someone might need first page of the pipelines in 9.4, but still might need 50th page in 9.6. The implication of this problem is that we can't rely on particular database schema to write migrations, this needs to use regular, well tested, code instead.
Because we need to make sure that each object-specific migration works fine with 9.4, 9.5, and as long as we support it, we need to have super-robust specs for it. It is good, because we can make each migration a regular class, and use TDD techniques to develop it, which would be much easier to write and maintain this stuff. But the maintenance of these migrations can be tedious.
Problems above are solvable, the biggest concern is that this introduces non-determinism and inconsistency on the database level. We still do use regular queries, pluck and stuff like this to access data, now in GitLab. We won't be able to use it if we start to depend on resource-specific migrations. Ci::Stage.all won't be a single-source of truth to know how many stages we have in the database, because we didn't migrate all of them yet. It might be possible to hook resource-specific migrations into ActiveRecord::Relation objects, and migrate relevant resources when someone does project.pipeline.stages, but there are a lot of quirky technical aspect behind this, and complexity is significant.

Background migrations

Background migrations are online-only migrations. This approach does not reduce total amount of time needed to migrate stuff, like the two techniques above do. It makes it possible to complete long-running migrations instead. The idea behind this is to use Sidekiq, and asynchronous background processing, to run migrations in the background, even if these need days to complete.

Pros:

The most boring solution, the least technical challenge to implement.

Cons:

It does not reduce the total amount of time needed to complete migrations.
There are still some technical challenges, like how to ensure that migrations completes, when the process was interrupted (machine restart, GitLab down, etc).
It is necessary to provide additional mechanisms that will ensure that everything works as expected, even if the migration is still in progress.
It requires more development effort to use this technique, because additional code-level mechanisms are required to compensate for data migrated only partially.

/cc @stanhu @dzaporozhets @ernstvn @ayufan @yorickpeterse @pcarranza @smcgivern