WIP: Introduce simple ActiveRecord-based bulk-insert functionality
NOTE: I am closing this in favor or two smaller MRs:
What does this MR do?
Adds support for bulk-inserting associations safely.
References: #196844 (closed)
New bulk insertion API
Bulk insertions are crucial for storing large amounts of data efficiently. However, we also identified the need for this to happen in a safe manner, i.e. by ensuring bulk insertions are only available when we can have certain guarantees that we are not causing integrity problems or violate business rules (often encoded in ActiveRecord validations.)
This MR extends on !24168 (merged) in the following ways:
BulkInsertSafe.[bulk_insert|bulk_insert!]
These two new methods operate on sequences of ActiveRecord objects. They behave similarly to save
and save!
in the sense that they run validations and either return a boolean
indicating success or raise an exception. This ensures that we won't be writing data which would not pass if they were instead inserted via save
or similar built-ins.
Internally these calls rely on ActiveRecord 6's new InsertAll
type, which inserts hashes in bulk, but does not run validations. This and the fact that validations are run are the primary differences to the existing Database.bulk_insert
helper.
Note that as of !24168 (merged) you can only access this functionality if (as the name suggests) your target model type is considered "safe for bulk insertion"; these rules are currently fairly simple and prevent certain callbacks from being registered, but can be easily expanded on in the future.
Code example:
class LabelLink < ApplicationRecord
include BulkInsertSafe
end
label_links = ... # build some label links
LabelLink.bulk_insert(label_links, batch_size: 100)
BulkInsertableAssociations
: insert has_many
associations in bulk
Given a type that is BulkInsertSafe
, if it appears on the "owned" end of a relationship such as has_many
, we can now bulk-insert these records via the owner. This is currently done using a combination of two method calls where we first schedule a set of records for bulk insertion, then flush them whenever the parent is saved:
class MergeRequestDiff
include BulkInsertableAssociations
has_many :merge_request_diff_commits
end
parent = MergeRequestDiff.new
diff_commits = ...
parent.try_bulk_insert_on_save(:merge_request_diff_commits, diff_commits)
...
parent.save # this will insert all pending `diff_commits` in bulk
Internally this is realized using an after_save
hook. This way we can exploit transactionality of AR's callback chains. The try_bulk_insert_on_save
helper actually lives on ApplicationRecord
to make these inserts safer and a little less awkward, since we cannot say upfront whether a) the parent defines that method and b) the association
we target is BulkInsertSafe
.
Migration path
Since this new API extends on existing bulk-insert functionality in several ways, we should establish:
- whether it can fully replace
Database.bulk_insert
- or whether it should live alongside it (considering it operates on AR instances, not row hashes)
- or whether we should first migrate to
insert_all
everywhere
TODOs:
-
ensure thread-safety -
handle validations on pending inserts -
what happens to new
items that are yet unsaved? -
implement batching -
documentation & better error messages with links to docs -
implement in importer, measure results -
insert_all
vsupsert_all
-
consider using insert_all!
to catch duplicate key errors -
bulk_insert
wrapper function -
feature toggle
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides - [-] Separation of EE specific content
Availability and Testing
- we plan to roll this out behind a feature flag first, where we can enable it in project imports
- it would be interesting to test this with another existing feature, but I would require some pointers what that could be