Break up load_balancing_strategy into more cases
The following discussion from !63304 (merged) should be addressed:
-
@nmilojevic1 started a discussion: (+5 comments) I think that we should still distinguish a use case when we did decide to fall back to the
primary
because the replica was not ready.I would treat this strategy the same for both
delayed
andsticky
. We decided to fallback to the primary because replica was not caught up.- For delayed, we will retry first time, second time we will fallback to the primary.
- For sticky, we will fallback to the primary.
Sounds like the same strategy.
But in order to debug more easily, I would separate this from the case when we stick to the primary by default (:always), or when location is not provided.
I would like to know, that replica was not ready, and that is the bare reason why we decided to hit the primary.
We realized that in order to better measure progress and success for utilizing Sidekiq load balancing, it would be useful to track worker configuration ("desired state") and actual load balancing strategy used ("actual state") in a bit more detail.
Our current approach is to track state according to the following table:
data_consistency | load_balancing_strategy | description |
---|---|---|
null | primary | feature flag disabled or it's not an ApplicationWorker |
:always | primary | Default behavior |
:sticky | primary | location was not provided or replica was not ready |
:sticky | replica | replica was ready |
:delayed | primary | location was not provided or replica was not ready for the second time after the job was retried |
:delayed | replica | replica was ready for the second time after the job was retried |
:delayed | retry_replica | replica was not ready, the job will be retried |
:delayed | retry_primary | replica was not ready on 2nd try |
There are a few extra complications or missing cases, however:
- primary was selected because location was not set ("in between" state where job was scheduled already but FF was still off)
- for
delayed
+replica
, was it caught up immediately or did we have to retry - when LB is not enabled the
data_consistency
is null; let's change this to never be nil but instead use the default ofalways
since that is what actually happens
We should decide which ones are worth tracking and break down this field more accordingly.
Suggested break down
To account for the above mentioned gaps, below is an extended mapping of states to field values:
data_consistency | load_balancing_strategy | description |
---|---|---|
:always | primary | LB N/A; data consistency not set or :always, FF disabled, or not an ApplicationWorker
|
:sticky | replica | At least one replica was ready |
:sticky | primary | No replica was ready |
:sticky | primary-no-wal | WAL location was not provided |
:delayed | replica | At least one replica was ready on 1st attempt |
:delayed | retry | No replica was ready on 1st attempt; retry the job |
:delayed | replica-retried | At least one replica was ready on 2nd attempt |
:delayed | primary | No replica ready on 2nd attempt |
:delayed | primary-no-wal | WAL location was not provided |