Distributed HyperLogLog batch counter (!45673) · Merge requests · GitLab.org / GitLab

Mikołaj Wawrzyniak requested to merge mwaw/distributed_hll_batch_counter into master Oct 20, 2020

What does this MR do?

Inspired with pure SQL HLL distinct count approach described at https://www.sisense.com/blog/hyperloglog-in-pure-sql/

This MR implements PoC for distributed HLL batch counter. Baseline idea is as follow:

Create HLL buckets at DB level for each batch
Fetch HLL buckets from application in a loop
Merge HLL buckets at application layer
Estimate relation size with HLL buckets at application layer

Main gains from distributed approach

Fixed upper limit of records that can be included into each analyzed batch, that assures constant maximal execution time for each batch of given size (see discussion at #230438 (comment 421212723) and #230438 (comment 421212723) for more detail)
Constant small number of fetched data by application (512 rows with 2 integers attributes)
Ability to merged different HLL buckets later on (unions and intersections between different sql metrics)

Why this MR is required

This MR was motivated by following things:

1. (Not usage ping specific): Problem with `BatchCount` not handling well distinct counting on non unique values for big relation

I've described it in details earlier at !45673 (comment 444843665) It is true that I've spotted this problem for some of usage ping metrics, but it is not limited to it, and can occur everywhere BatchCount is used, when some pre existing conditions are met

2. (Usage ping specific) Calculating unions and intersections of usage ping metrics

Currently we are working on providing information how many distinct entries were recorded for all of the selected metrics subset, or how many distinct entries was recorded that was present in all of the listed metrics subset !46146 (merged).

Although for pure SQL counted metrics it would be possible to use UNION and than run distinct count on that relation, we are also heaving metrics that are recorded through Redis HLL algorithm with https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/usage_data_counters/hll_redis_counter.rb Because of that it is impossible to use SQL UNION to obtain union and intersection across multiple data sources.

In the long run, we would need to replace Redis HLL algorithm with internal HLL implementation to have HLL buckets that are compatible with each other for every data source and calculate unions and intersection with them. This implementation is a first step towards that goal. Another advantage that this implementation has over UNION version is that once built HLL buckets can be used to obtain multiple unions and intersections, while resulting value from UNION distinct count can not.

Alternative approaches were evaluated at #270444 (closed)

Benchmark

I've compared existing Gitlab::Database::BatchCount#batch_distinct_count with newly proposed with following benchmark, on relation (Ci::Builds) that contains 20 010 877 records in total and 2 570 000 fulfilling filter condition.

require 'benchmark/ips'

Benchmark.ips do |x|
  x.report('hll distributed batch counting') do
    Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
  end

  x.report('batch counting') do
    Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
  end

  x.compare!
end

Calculating -------------------------------------
hll distributed batch counting
                         1.000  i/100ms
      batch counting     1.000  i/100ms
-------------------------------------------------
hll distributed batch counting
                          0.040  (± 0.0%) i/s -      1.000  in  25.089552s
      batch counting      0.039  (± 0.0%) i/s -      1.000  in  25.643170s

Comparison:
hll distributed batch counting:        0.0 i/s
      batch counting:        0.0 i/s - 1.02x slower

=> #<Benchmark::IPS::Report:0x00007f88788e6ce8
 @data=nil,
 @entries=[#<Benchmark::IPS::Report::Entry:0x00007f886cc86698 @ips=0.03985722871678709, @ips_sd=0, @iterations=1, @label="hll distributed batch counting", @measurement_cycle=1, @microseconds=25089551.6872406, @show_total_time=true>, #<Benchmark::IPS::Report::Entry:0x00007f887b247ab0 @ips=0.03899673888536357, @ips_sd=0, @iterations=1, @label="batch counting", @measurement_cycle=1, @microseconds=25643169.87991333, @show_total_time=true>]>

Results returned by each of counters were as following

[9] pry(main)> Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2601007
[10] pry(main)> Gitlab::Database::BatchCount.batch_distinct_count(::Ci::Build.where(name: 'dast'), :user_id)
=> 2570000

HyperLogLog algorith estimation was off by ~1.2%

Accuracy estimation

Size	HyperLogLog	Exact count	Error Rate %
10	10.487024297669207	10	4.870242976692069
100	105.30880048123788	100	5.308800481237881
1000	1007.2682972730778	1000	0.7268297273077735
10000	9689	10000	-3.1099999999999994
100000	102621	100000	2.6210000000000093
1000000c	996113	1000000c	-0.38870000000000005
10000000	10070622	10000000	0.7062200000000018

Accuracy estimation was done with following snippet

require './spec/support/sidekiq_middleware'

class Gitlab::Seeder::Users
  include ActionView::Helpers::NumberHelper

  MAX_BATCH_SIZE = 100_000

  attr_reader :opts

  def initialize(mass_users_count)
    @mass_users_count = mass_users_count
  end

  def seed!
    Sidekiq::Testing.inline! do
      create_mass_users!
    end
  end

  private

  def create_mass_users!
    encrypted_password = Devise::Encryptor.digest(User, random_password)

    batch_size = @mass_users_count > MAX_BATCH_SIZE ? MAX_BATCH_SIZE : @mass_users_count
    batch_start = 1
    # 10
    while batch_start < @mass_users_count do
      ActiveRecord::Base.connection.execute <<~SQL
        INSERT INTO users (username, name, email, confirmed_at, projects_limit, encrypted_password)
        SELECT
          '#{Gitlab::Seeder::MASS_INSERT_USER_START}' || seq,
          'Seed user ' || seq,
          'seed_user' || seq || '@example.com',
          to_timestamp(seq),
          #{@mass_users_count},
          '#{encrypted_password}'
        FROM generate_series(#{batch_start}, #{batch_start + batch_size - 1}) AS seq
      SQL
      batch_start += batch_size
    end
  end

  def random_password
    @random_password ||= SecureRandom.hex.slice(0,16)
  end
end


vals = []
[10, 100, 1_000, 10_000,  100_000, 1_000_000, 10_000_000].each do |size|
  Gitlab::Seeder::Users.new(size).seed!
  relation = User
  hll_est = Gitlab::Database::PostgresHllBatchDistinctCount.batch_distinct_count(relation, :email)
  exact_count =  size
  vals << [size, hll_est, exact_count]
  ActiveRecord::Base.connection.execute('TRUNCATE users CASCADE')
end

template = "| %{size} | %{hll} | %{exact} | %{err} | \n"

print template % { size: 'Size', hll: 'HyperLogLog', exact: 'Exact count', err: 'Error Rate' }

vals.each do |size, hll_est, exact_count|
  print template % { size: size, hll: hll_est, exact: exact_count, err: (hll_est / exact_count.to_f) * 100 - 100 }
end

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Nov 17, 2020 by Mikołaj Wawrzyniak

Distributed HyperLogLog batch counter