Distributed HyperLogLog batch counter
What does this MR do?
Inspired with pure SQL HLL distinct count approach described at https://www.sisense.com/blog/hyperloglog-in-pure-sql/
This MR implements PoC for distributed HLL batch counter. Baseline idea is as follow:
- Create HLL buckets at DB level for each batch
- Fetch HLL buckets from application in a loop
- Merge HLL buckets at application layer
- Estimate relation size with HLL buckets at application layer
Main gains from distributed approach
- Fixed upper limit of records that can be included into each analyzed batch, that assures constant maximal execution time for each batch of given size (see discussion at #230438 (comment 421212723) and #230438 (comment 421212723) for more detail)
- Constant small number of fetched data by application (512 rows with 2 integers attributes)
- Ability to merged different HLL buckets later on (unions and intersections between different sql metrics)
Why this MR is required
This MR was motivated by following things:
BatchCount
not handling well distinct counting on non unique values for big relation
1. (Not usage ping specific): Problem with I've described it in details earlier at !45673 (comment 444843665) It is true that I've spotted this problem for some of usage ping metrics, but it is not limited to it, and can occur everywhere BatchCount
is used, when some pre existing conditions are met
2. (Usage ping specific) Calculating unions and intersections of usage ping metrics
Currently we are working on providing information how many distinct entries were recorded for all of the selected metrics subset, or how many distinct entries was recorded that was present in all of the listed metrics subset !46146 (merged).
Although for pure SQL counted metrics it would be possible to use UNION
and than run distinct count on that relation, we are also heaving metrics that are recorded through Redis HLL algorithm with https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/usage_data_counters/hll_redis_counter.rb Because of that it is impossible to use SQL UNION
to obtain union and intersection across multiple data sources.
In the long run, we would need to replace Redis HLL algorithm with internal HLL implementation to have HLL buckets that are compatible with each other for every data source and calculate unions and intersection with them. This implementation is a first step towards that goal. Another advantage that this implementation has over UNION
version is that once built HLL buckets can be used to obtain multiple unions and intersections, while resulting value from UNION
distinct count can not.
Alternative approaches were evaluated at #270444 (closed)
Benchmark
I've compared existing Gitlab::Database::BatchCount#batch_distinct_count
with newly proposed with following benchmark, on relation (Ci::Builds
) that contains 20 010 877 records in total and 2 570 000 fulfilling filter condition.
require 'benchmark/ips'
Benchmark.ips do |x|
x.report('hll distributed batch counting') do
Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
end
x.report('batch counting') do
Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
end
x.compare!
end
Calculating -------------------------------------
hll distributed batch counting
1.000 i/100ms
batch counting 1.000 i/100ms
-------------------------------------------------
hll distributed batch counting
0.040 (± 0.0%) i/s - 1.000 in 25.089552s
batch counting 0.039 (± 0.0%) i/s - 1.000 in 25.643170s
Comparison:
hll distributed batch counting: 0.0 i/s
batch counting: 0.0 i/s - 1.02x slower
=> #<Benchmark::IPS::Report:0x00007f88788e6ce8
@data=nil,
@entries=[#<Benchmark::IPS::Report::Entry:0x00007f886cc86698 @ips=0.03985722871678709, @ips_sd=0, @iterations=1, @label="hll distributed batch counting", @measurement_cycle=1, @microseconds=25089551.6872406, @show_total_time=true>, #<Benchmark::IPS::Report::Entry:0x00007f887b247ab0 @ips=0.03899673888536357, @ips_sd=0, @iterations=1, @label="batch counting", @measurement_cycle=1, @microseconds=25643169.87991333, @show_total_time=true>]>
Results returned by each of counters were as following
[9] pry(main)> Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2601007
[10] pry(main)> Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2570000
HyperLogLog algorith estimation was off by ~1.2%
Accuracy estimation
Size | HyperLogLog | Exact count | Error Rate % |
---|---|---|---|
10 | 10.487024297669207 | 10 | 4.870242976692069 |
100 | 105.30880048123788 | 100 | 5.308800481237881 |
1000 | 1007.2682972730778 | 1000 | 0.7268297273077735 |
10000 | 9689 | 10000 | -3.1099999999999994 |
100000 | 102621 | 100000 | 2.6210000000000093 |
1000000c | 996113 | 1000000c | -0.38870000000000005 |
10000000 | 10070622 | 10000000 | 0.7062200000000018 |
Accuracy estimation was done with following snippet
require './spec/support/sidekiq_middleware'
class Gitlab::Seeder::Users
include ActionView::Helpers::NumberHelper
MAX_BATCH_SIZE = 100_000
attr_reader :opts
def initialize(mass_users_count)
@mass_users_count = mass_users_count
end
def seed!
Sidekiq::Testing.inline! do
create_mass_users!
end
end
private
def create_mass_users!
encrypted_password = Devise::Encryptor.digest(User, random_password)
batch_size = @mass_users_count > MAX_BATCH_SIZE ? MAX_BATCH_SIZE : @mass_users_count
batch_start = 1
# 10
while batch_start < @mass_users_count do
ActiveRecord::Base.connection.execute <<~SQL
INSERT INTO users (username, name, email, confirmed_at, projects_limit, encrypted_password)
SELECT
'#{Gitlab::Seeder::MASS_INSERT_USER_START}' || seq,
'Seed user ' || seq,
'seed_user' || seq || '@example.com',
to_timestamp(seq),
#{@mass_users_count},
'#{encrypted_password}'
FROM generate_series(#{batch_start}, #{batch_start + batch_size - 1}) AS seq
SQL
batch_start += batch_size
end
end
def random_password
@random_password ||= SecureRandom.hex.slice(0,16)
end
end
vals = []
[10, 100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].each do |size|
Gitlab::Seeder::Users.new(size).seed!
relation = User
hll_est = Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(relation, :email)
exact_count = size
vals << [size, hll_est, exact_count]
ActiveRecord::Base.connection.execute('TRUNCATE users CASCADE')
end
template = "| %{size} | %{hll} | %{exact} | %{err} | \n"
print template % { size: 'Size', hll: 'HyperLogLog', exact: 'Exact count', err: 'Error Rate' }
vals.each do |size, hll_est, exact_count|
print template % { size: size, hll: hll_est, exact: exact_count, err: (hll_est / exact_count.to_f) * 100 - 100 }
end
Does this MR meet the acceptance criteria?
Conformity
- [-] Changelog entry
- [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team