(2) Pseudonymize GitLab Standard Context for Snowplow
Goal
In order to understand single user behavior patterns, but not connect it to any natural person using GitLab we need to pseudonymize user_id before it can be recorded anywhere.
What should be pseuydonymized
Reference issue: #336779 (closed)
field | need pseudonymization | reasoning | who made decision |
---|---|---|---|
user_id | Yes | This is an indirect indentifier which can be used to reveal directly identifiable data. With user_id anyone can access name and username of both public and private profiles. |
@amandarueda |
Technical implementation
Let's use actors based feature flag to have control over feature rollout
Following pesudocode demonstrates simplified encryption process:
key_size_bits = 2048
pkey = OpenSSL::PKey::RSA.new(key_size_bits)
payload = 123.to_s
padding_size = key_size_bits / 8 - payload.bytesize
if padding_size >= 0
padding = "#{'a' * padding_size}"
pkey.public_encrypt("#{padding}#{payload}", OpenSSL::PKey::RSA::NO_PADDING)
else
raise StandardError, "payload was to long for encryption with give key"
end
Things to that should be decided:
- Size of used key, it dictates how long payload can be encrypted. Alternatively we can consider hashing data first, and then encrypting it. That way we should avoid problems with varying payload length
- Where and how keyfile should be stored and than made available for encryption service
- If using disc IO to get key would be sufficient performance wise, or if we should use some cache for key, or get back to collector layer concept
Why NO_PADDING
was selected, by default padding of OpenSSL::PKey::RSA::PKCS1_PADDING
is used if no one is provided by #public_encrypt
, also OpenSSL::PKey::RSA::PKCS1_OAEP_PADDING
is available, but because each of those paddings is based on random values, it does not maintain deterministic output for single input, which would not suffice in terms of analytics requirements. We need to come up with our own padding method, that would deliver deterministic results