(2) Pseudonymize GitLab Standard Context for Snowplow

Goal

In order to understand single user behavior patterns, but not connect it to any natural person using GitLab we need to pseudonymize user_id before it can be recorded anywhere.

What should be pseuydonymized

Reference issue: #336779 (closed)

field	need pseudonymization	reasoning	who made decision
user_id	Yes	This is an indirect indentifier which can be used to reveal directly identifiable data. With `user_id` anyone can access `name` and `username` of both public and private profiles.	@amandarueda

Technical implementation

Let's use actors based feature flag to have control over feature rollout

Following pesudocode demonstrates simplified encryption process:

key_size_bits = 2048
pkey = OpenSSL::PKey::RSA.new(key_size_bits)
payload = 123.to_s

padding_size = key_size_bits / 8 - payload.bytesize

if padding_size >= 0
  padding = "#{'a' * padding_size}"
  pkey.public_encrypt("#{padding}#{payload}", OpenSSL::PKey::RSA::NO_PADDING)
else
  raise StandardError, "payload was to long for encryption with give key"
end

Things to that should be decided:

Size of used key, it dictates how long payload can be encrypted. Alternatively we can consider hashing data first, and then encrypting it. That way we should avoid problems with varying payload length
Where and how keyfile should be stored and than made available for encryption service
If using disc IO to get key would be sufficient performance wise, or if we should use some cache for key, or get back to collector layer concept

Why NO_PADDING was selected, by default padding of OpenSSL::PKey::RSA::PKCS1_PADDING is used if no one is provided by #public_encrypt, also OpenSSL::PKey::RSA::PKCS1_OAEP_PADDING is available, but because each of those paddings is based on random values, it does not maintain deterministic output for single input, which would not suffice in terms of analytics requirements. We need to come up with our own padding method, that would deliver deterministic results

Edited Aug 16, 2021 by Amanda Rueda