Skip to content

Optimize DF pipeline to preprocess the raw dataset

Alexander Chueshev requested to merge optimize-df-preprocessing into main

This MR optimizes the DF pipeline used to preprocess the raw dataset:

  • filter out autogenerated files just right after the exact deduplication
  • set n_comments_copyright to 30 comments
  • optimize the exact deduplication transform to avoid OOM
  • take the last suffix only to infer the file language, e.g., file.x.y >> .y instead of .x.y
  • change the input schema to match the repo_contents_v2 table

Please, check the preprocessed data stored in unreview-poc-390200e5.gl_code_suggestions.dataset_v2.

Edited by Alexander Chueshev

Merge request reports

Loading