Expose unrecovered import failures in /import REST endpoint
What does this MR do?
Refs #210522 (closed)
In the first MR we started by tracking and exposing correlation IDs in ProjectImportState
. This MR completes the story by adding the ability to fetch "hard failures" (those relations that failed to import due to some unrecoverable error, or after retries were exhausted), correlate them to import state, and expose them via the /import
REST endpoints.
Note the new failed_relations
element:
{
"id": 39,
"description": "Ut soluta dignissimos sapiente atque nam eius perferendis.",
"name": "test-import",
"name_with_namespace": "Administrator / test-import",
"path": "test-import",
"path_with_namespace": "root/test-import",
"created_at": "2020-04-03T12:48:12.314Z",
"import_status": "finished",
"correlation_id": "305413002760596bde01209d46a83cec",
"failed_relations": [
{
"id": 324,
"created_at": "2020-04-02T14:48:59.526Z",
"exception_class": "RuntimeError",
"exception_message": "blah",
"source": "process_relation_item!",
"relation_name": "ci_cd_settings"
}
]
}
I specifically did not map ImportFailure
to a REST resource 1:1, since it is a fairly technical/internal concept, and rather defined a resource called ProjectImportFailedRelation
, which is a narrower concept than just any failure (for instance, a failure can be "soft" in the sense that it was a retry, but ultimately succeeded; we are not interested in exposing these via the API, as they have no bearing on the outcome of the import.) A failed relation as a REST resource it therefore always a "hard failure" i.e. a relation that failed to be imported, despite the import overall finishing successfully.
For reasons of MVC I also decided to simply limit the nested array to 100 items tops. Until there is a product requirement that these will actually be surfaced inside the product itself and be paginated or even fetched separately, I don't think we need endpoints to fetch these in full. For now we will rather report these via a pipeline into Slack as "hey look, we have an import that finished but had 100+ individual failures, maybe you should go look at this."
Database
After some experimentation we ended up deciding to add a partial index on import_failures
that covers retry_count = 0
. This will speed up selection of hard failures.
db/migrate/20200409085956_add_partial_index_on_import_failures_retry_count.rb
db/structure.sql
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: migrating ========
-- transaction_open?()
-> 0.0000s
-- index_exists?(:import_failures, [:project_id, :correlation_id_value], {:where=>"retry_count = 0", :algorithm=>:concurrently})
-> 0.0035s
-- execute("SET statement_timeout TO 0")
-> 0.0003s
-- add_index(:import_failures, [:project_id, :correlation_id_value], {:where=>"retry_count = 0", :algorithm=>:concurrently})
-> 0.0065s
-- execute("RESET ALL")
-> 0.0003s
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: migrated (0.0108s)
git@f3c2b5e07624:~/gitlab$ bin/rake db:migrate:down VERSION=20200409085956
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: reverting ========
-- transaction_open?()
-> 0.0000s
-- index_exists?(:import_failures, [:project_id, :correlation_id_value], {:algorithm=>:concurrently})
-> 0.0038s
-- execute("SET statement_timeout TO 0")
-> 0.0003s
-- remove_index(:import_failures, {:algorithm=>:concurrently, :column=>[:project_id, :correlation_id_value]})
-> 0.0062s
-- execute("RESET ALL")
-> 0.0003s
== 20200409085956 AddPartialIndexOnImportFailuresRetryCount: reverted (0.0109s)
Query:
explain
SELECT "import_failures".*
FROM "import_failures"
WHERE "import_failures"."project_id" = <snip>
AND "import_failures"."correlation_id_value" = <snip>
AND "import_failures"."retry_count" = 0
ORDER BY "import_failures"."created_at" DESC
LIMIT 100
Before index:
Limit (cost=6.30..6.30 rows=1 width=160) (actual time=2.056..2.056 rows=0 loops=1)
Buffers: shared hit=3 read=2
I/O Timings: read=1.834
-> Sort (cost=6.30..6.30 rows=1 width=160) (actual time=2.053..2.055 rows=0 loops=1)
Sort Key: import_failures.created_at DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=3 read=2
I/O Timings: read=1.834
-> Index Scan using index_import_failures_on_correlation_id_value on public.import_failures (cost=0.29..6.29 rows=1 width=160) (actual time=1.907..1.907 rows=0 loops=1)
Index Cond: ((import_failures.correlation_id_value)::text = 'cid'::text)
Filter: ((import_failures.project_id = <snip>) AND (import_failures.retry_count = 0))
Rows Removed by Filter: 0
Buffers: shared read=2
I/O Timings: read=1.834
After index:
Limit (cost=4.32..4.32 rows=1 width=160) (actual time=0.266..0.266 rows=0 loops=1)
Buffers: shared hit=6 read=2
I/O Timings: read=0.169
-> Sort (cost=4.32..4.32 rows=1 width=160) (actual time=0.264..0.264 rows=0 loops=1)
Sort Key: import_failures.created_at DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=6 read=2
I/O Timings: read=0.169
-> Index Scan using idx_import_failures_on_hard_failures on public.import_failures (cost=0.29..4.31 rows=1 width=160) (actual time=0.198..0.198 rows=0 loops=1)
Index Cond: ((import_failures.project_id = <snip>) AND ((import_failures.correlation_id_value)::text = 'cid'::text))
Buffers: shared hit=3 read=2
I/O Timings: read=0.169
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides - [-] Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Tested locally against a puma
container as well.