Skip to content

hook: Increase the timeout when casting votes

Patrick Steinhardt requested to merge pks-hook-increase-tx-timeouts into master

In production, we're hitting timeouts from time to time when voting on transactions, where those timeouts typically always happen on secondary nodes which are waiting for the primary to cast its vote. The root cause of this is that the primary node executes logic in hooks which secondary nodes don't, like posting to the /internal/allowed API to check whether a push is allowed and custom hook logic.

Unfortunately, both of those actions may take quite some time to finish: the allowed check scales with repository size as Rails will execute several RPCs to inspect the pushed data, while the custom hook logic is not controlled by us and may potentially run in unbounded time. Those primary-only actions can thus easily bust the 10 seconds we have set up as timeout for transactional voting.

Given that those steps are potentially unbounded, we could just go ahead and remove the timeout altogether. This does feel a bit dangerous though: if a node part of an ongoing transaction is failing, then it wouldn't ever cast its vote and other nodes may potentially be stuck. If we didn't have a timeout, we'd now rely on the overarching RPC to have a timeout set, otherwise we would be stuck forever waiting for quorum to be reached. There are RPCs where having a timeout doesn't really make a lot of sense, and unfortunately the relevant RPCs in the context of hooks are part of that class as git pushes may take a very long time. So we cannot rely on those RPCs having a sensible timeout.

Given that we cannot rely on any timeout being set but still want to eventually abort the transaction to not get stuck forever, this commit instead introduces an arbitrarily chosen high timeout of 5 minutes to the voting logic. This assumes that the combination of access checks and custom hook logic should never exceed these 5 minutes, which feels like a reasonable assumption to make.

In order to assert that it's still possible to set shorter timeouts if the surrounding RPC has one, a new testcase is added which uses a practically immediate timeout on the RPC.

Merge request reports

Loading