WIP: Use de-duplication to reduce memory and amount of SQL queries for import
What does this MR do?
This uses de-duplication to:
-
reduce amount of memory needed to hold hash, as there's a ton of duplication,
-
re-uses already created relations instead of creating a new ones
This is based on: !18005 (merged) !18003 (merged) !18007 (merged) !18024 (merged)
Problems
We need to be careful when de-duplication can be used, as it can introduce hard to debug problems.
Lets consider the following example:
"merge_requests": [
{
"id": 27,
"target_branch": "feature",
"source_branch": "feature_conflict",
"source_project_id": 999,
"author_id": 1,
"merge_params": {
"force_remove_source_branch": null
},
...
"resource_label_events": [
{
"id":243,
"action":"add",
"issue_id":null,
"merge_request_id":27,
"label_id":null,
"user_id":1,
"created_at":"2018-08-28T08:24:00.494Z"
}
],
There are a problems with:
-
merge_params
, which might point to the samehash
, -
resource_label_events
(not here exactly, as there's uniqueid
).
The merge_params
case needs to be considered automatically,
so de-duplication needs to understand whether the hierarchy it defines
is linked top-level.
Ideally, it means that we should de-duplicate only objects on top-level, understanding that objects on lower levels could be re-used only if matching entry is found on top-level.
It means that we should consider creating de-dups only for relations that are:
-
labels
=>label
, -
milestones
=>milestone
, - likely others as well
It reduces the efficiency, but should reduce the chance of going side-ways.