Optimize DF pipeline to preprocess the raw dataset (!18) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Development

This MR optimizes the DF pipeline used to preprocess the raw dataset:

filter out autogenerated files just right after the exact deduplication
set n_comments_copyright to 30 comments
optimize the exact deduplication transform to avoid OOM
take the last suffix only to infer the file language, e.g., file.x.y >> .y instead of .x.y
change the input schema to match the repo_contents_v2 table

Please, check the preprocessed data stored in unreview-poc-390200e5.gl_code_suggestions.dataset_v2.

Edited Mar 28, 2023 by Alexander Chueshev

Optimize DF pipeline to preprocess the raw dataset