Optimize DF pipeline to preprocess the raw dataset
This MR optimizes the DF pipeline used to preprocess the raw dataset:
- filter out autogenerated files just right after the exact deduplication
- set
n_comments_copyright
to 30 comments - optimize the exact deduplication transform to avoid OOM
- take the last suffix only to infer the file language, e.g.,
file.x.y
>>.y
instead of.x.y
- change the input schema to match the
repo_contents_v2
table
Please, check the preprocessed data stored in unreview-poc-390200e5.gl_code_suggestions.dataset_v2
.
Edited by Alexander Chueshev