Add columns and indices on web hooks [RUN ALL RSPEC] [RUN AS-IF-FOSS]
What does this MR do?
This MR makes changes to the backend representation of webhooks so that we can mark repeatedly failing webhooks as failing, and avoid executing them in the future.
Follow-up work will surface this information in the UI, and allow the web-hooks to be reset.
see: #329213
Backoff Strategy
See this comment for an explanation of the logic behind the back-off strategy: https://gitlab.com/gitlab-org/gitlab/-/issues/329213#note_564836459
The backoff strategy implemented is:
- regular
4xx
failures: 3 strikes policy. Once we have seen 4 failures then we consider the hook disabled, and never try it again -
5xx
failures, and more low level stuff (network errors, etc): exponential backoff.
The cooldown for 500
errors starts at 10 minutes, and is triggered on the first
internal server error. The backoff factor is 2.0
. The backoff never exceeds 24 hours.
The reasoning is that 4xx
indicates that the webhook was probably set up incorrectly, but
5xx
suggests an unhealthy service, one which we may just need to give some cool-down time.
429 TOO MANY REQUESTS
is a possible exception - we might want to handle this differently.
Manual QA
To play around with this feature, you need to enable the feature-flag:
Feature.enable(:web_hooks_disable_failed)
Then we can verify that this work by:
- creating a web hook to an endpoint (I like to use https://webhook.site)
- verify that the endpoint receives data
- modify the endpoint to return failing statuses (e.g.
400
) - make at least 4 changes - check that the endpoint receives 4 messages
- verify that the endpoint stops receiving data
We can then double check in the console:
p = Project.find_by_full_path('some-path/to-project')
hook = ProjectHook.where(project: p).first
hook.executable?
=> false
hook.web_hook_logs.pluck(:response_status)
=> [... "400", "400", "400", "400"]
We can now restore the hook. First in the console:
p = Project.find_by_full_path('some-path/to-project')
hook = ProjectHook.where(project: p).first
hook.enable!
hook.executable?
=> true
Now make any further change, and verify that the endpoint receives a new message.
Migration output:
Up
== 20210429181325 AddFailureTrackingToWebHooks: migrating =====================
-- change_table(:web_hooks, {:bulk=>true})
-> 0.0052s
== 20210429181325 AddFailureTrackingToWebHooks: migrated (0.0053s) ============
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: migrating =========
-- transaction_open?()
-> 0.0000s
-- index_exists?(:web_hooks, [:project_id, :recent_failures], {:name=>"index_web_hooks_on_project_id_recent_failures", :algorithm=>:concurrently})
-> 0.0074s
-- add_index(:web_hooks, [:project_id, :recent_failures], {:name=>"index_web_hooks_on_project_id_recent_failures", :algorithm=>:concurrently})
-> 0.0186s
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: migrated (0.0283s)
Down
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: reverting =========
-- transaction_open?()
-> 0.0000s
-- indexes(:web_hooks)
-> 0.0039s
-- remove_index(:web_hooks, {:algorithm=>:concurrently, :name=>"index_web_hooks_on_project_id_recent_failures"})
-> 0.0028s
== 20210504085144 AddIndexOnWebHookProjectIdRecentFailures: reverted (0.0078s)
== 20210429181325 AddFailureTrackingToWebHooks: reverting =====================
-- remove_column(:web_hooks, :disabled_until, :timestamptz, {})
-> 0.0029s
-- remove_column(:web_hooks, :backoff_count, :integer, {:null=>false, :limit=>2, :default=>0})
-> 0.0013s
-- remove_column(:web_hooks, :recent_failures, :integer, {:null=>false, :limit=>2, :default=>0})
-> 0.0015s
== 20210429181325 AddFailureTrackingToWebHooks: reverted (0.0071s) ============
### Changes to queries
This MR changes select_active
to add a new where clause:
SELECT "web_hooks".* FROM "web_hooks"
WHERE "web_hooks"."project_id" = 278964
AND (recent_failures <= 3
AND (disabled_until IS NULL OR disabled_until < '2021-05-04 06:23:39.829460'))
Plan:
Index Scan using web_hooks_project_id_recent_failures_disabled_until_idx on public.web_hooks (cost=0.42..3.45 rows=1 width=282) (actual time=0.151..0.170 rows=2 loops=1)
Index Cond: ((web_hooks.project_id = 278964) AND (web_hooks.recent_failures <= 3))
Filter: ((web_hooks.disabled_until IS NULL) OR (web_hooks.disabled_until < '2021-05-04 06:23:39.82946+00'::timestamp with time zone))
Rows Removed by Filter: 0
Buffers: shared hit=2 read=3
I/O Timings: read=0.078
Timings:
Time: 2.317 ms
- planning: 2.109 ms
- execution: 0.208 ms
- I/O read: 0.078 ms
- I/O write: N/A
Shared buffers:
- hits: 2 (~16.00 KiB) from the buffer pool
- reads: 3 (~24.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
This MR changes the query in Group.execute_hooks
to include the executable scope:
SQL:
SELECT "web_hooks".* FROM "web_hooks"
WHERE "web_hooks"."type" = 'GroupHook'
AND "web_hooks"."group_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" = 22)
AND (recent_failures <= 3 AND (disabled_until IS NULL OR disabled_until < '2021-05-04 10:27:08.292342'))
Plan:
Nested Loop Semi Join (cost=0.71..6.77 rows=1 width=282) (actual time=2.163..2.164 rows=0 loops=1)
Buffers: shared read=2
I/O Timings: read=2.115
-> Index Scan using index_web_hooks_on_group_id on public.web_hooks (cost=0.28..3.31 rows=1 width=282) (actual time=2.162..2.162 rows=0 loops=1)
Index Cond: (web_hooks.group_id = 22)
Filter: ((web_hooks.recent_failures <= 3) AND ((web_hooks.disabled_until IS NULL) OR (web_hooks.disabled_until < '2021-05-04 10:27:08.292342+00'::timestamp with time zone)))
Rows Removed by Filter: 0
Buffers: shared read=2
I/O Timings: read=2.115
-> Index Only Scan using index_namespaces_on_type_and_id_partial on public.namespaces (cost=0.43..3.45 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: ((namespaces.type = 'Group'::text) AND (namespaces.id = 22))
Heap Fetches: 0
Timings:
Time: 5.736 ms
- planning: 3.533 ms
- execution: 2.203 ms
- I/O read: 2.115 ms
- I/O write: N/A
Shared buffers:
- hits: 0 from the buffer pool
- reads: 2 (~16.00 KiB) from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
Does this MR meet the acceptance criteria?
Conformity
-
📋 Does this MR need a changelog?-
I have included a changelog entry.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
The main risks of this change is that we may disable a user's web hooks in a way that prevents them from doing anything to remediate the situation. For this reason, we should not enable the feature flag until UI elements have been introduced.
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.