Redact PII during data preprocessing
To redact PII, we follow the same approach as the HF SantaCoder project.
Entities we're able to identify with the target change:
- emails using reg. expression
- IP v4/v6 using reg. expression and additional filters to remove false positives
- we mask only public IP addresses
- we require that the IP address is not a popular DNS address like 8.8.8.8
- secrets using detec_secrets
- we require the secret to sound like gibberish
- we do not mask hashes if there are keywords like
hash
,md5
in the context of the secret
More details in https://arxiv.org/abs/2301.03988 (Section 4)
Masks we apply to redact PII:
- emails => random example email with the format
xxxx@example.com
- public IPs => private IP addresses (v4 or v6, depending on the target) randomly selected from the predefined list
- secrets => random string of the same length
Edited by Alexander Chueshev