Differentiate metrics and logs from replica/primary databases
What does this MR do?
Closes #323164 (closed) Closes #323165 (closed)
In summary, our instrumentation infrastructure unify all accesses to the databases in the same metrics. In many cases, especially in some recent incidents, it may be extremely useful if we can know the sources of the query, replica utilization, and time spent in each database in a particular request. In fact, in my local development, it turns out that some read-only, safely cacheable queries are done in the primary database. Therefore, it is useful to have those information available in the performance bar as well. This MR is to differentiate the metrics and logs between replica/primary database role. In detail:
- Introduce
db_<role>_count
,db_<role>_cached_count
anddb_<role>_duration_s
in web logs and Sidekiq structured logs - Introduce
gitlab_transaction_db_<role>_count_total
prometheus counter - Introduce
gitlab_sql_<role>_duration_seconds
prometheus histogram - Add
replica
,primary
tags into Active Record performance bar
All of those features are enabled only if database load balancing is enabled. If it doesn't, no further information is added. It means that this change barely affects self-managed instances.
Solution
Extracted from gitlab-com/gl-infra/scalability#873 (comment 511091812)
When the load balancing is enabled, ActiveRecord::Base
is patched so that ActiveRecord::Base#connection
returns a Gitlab::Database::LoadBalancing::ConnectionProxy
instance wrapping around PostgreSQLAdapter
. This proxy redirects read and write statements to corresponding connections (primary/replica). Luckily, ConnectionProxy is patching high-level statements, and fallback all method calls to the original ActiveRecord connection. It means that the connection
field in the event payload above is guaranteed to be the connection after replica redirection.
The full pipeline look like this:
User.find(1)
-> call ActiveRecord::Base.connection
-> return an instance of Gitlab::Database::LoadBalancing::ConnectionProxy
-> call Gitlab::Database::LoadBalancing::ConnectionProxy#select
-> call Gitlab::Database::LoadBalancing::LoadBalancer#read or #write
-> return a ActiveRecord::ConnectionAdapters::PostgreSQLAdapter object from primary or replica host.
-> call ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#select
-> call ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache and fiends
-> Broadcast the instrumentation event
-> Listeners capture and accumulate the events
If an accessor is not covered by the proxy, for example ActiveRecord::Base.connection.query('select 1')
, the proxy fallback to the connection object and the flow stays the same.
As soon as the broadcasted connection objects are proved to be the raw connection objects, we classify the events with confidence. In detail, most of the work is to modify Gitlab::Metrics::Subcribers::ActiveRecord
:
- Store the roles (primary/replica) of each connection after they are retrieved from the load balancer.
- When listening to
sql.active_record
event, the metric subscriber calls the globalGitlab::Database::LoadBalancing#db_role
method to classify the receiving connection - Broadcast metrics and accumulate blogs to use in lograge and structured log
Screenshots (strongly suggested)
New log items in Web JSON logs
New log items in API JSON logs
New tags in the Active Record section in the performance bar
New prometheus metrics
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separati on of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team