Skip to content

Add a DF pipeline for basic dataset preprocessing

Alexander Chueshev requested to merge preprocess-dataset into main

This MR adds the DF pipeline that:

  • reads data from BQ
  • applies the following preprocessing similar to Polycoder and Codex:
    • filter by line max, line min
    • filter out autogenerated files
    • exact deduplication after str.strip()
    • filter by the maximum fraction of non-alphanumeric characters
  • infers PL from the file path names
  • split the full dataset into training, test, and validation sets
  • writes the preprocessed dataset back to BQ

How to run the pipeline:

export GOOGLE_APPLICATION_CREDENTIALS=<path to json file>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow

./venv/bin/python ./data/df/preprocessing.py \
  --runner=DataflowRunner \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_repo_contents_v1" \
  --output_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
  --temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
  --save_main_session \
  --machine_type=n1-highmem-8

Where to find a data example?

I've run the pipeline for our sampled dataset sample_repo_contents_v1 (15% of the initial one) to check whether it works correctly. Please, find the output preprocessed dataset in sample_preprocessed_dataset_v1.

Ref ai-assist#22 (closed)

cc @mray2020

Edited by Alexander Chueshev

Merge request reports

Loading