Add a DF pipeline for basic dataset preprocessing (!4) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / Model Development

Alexander Chueshev requested to merge preprocess-dataset into main Mar 06, 2023

This MR adds the DF pipeline that:

reads data from BQ
applies the following preprocessing similar to Polycoder and Codex:
- filter by line max, line min
- filter out autogenerated files
- exact deduplication after str.strip()
- filter by the maximum fraction of non-alphanumeric characters
infers PL from the file path names
split the full dataset into training, test, and validation sets
writes the preprocessed dataset back to BQ

How to run the pipeline:

export GOOGLE_APPLICATION_CREDENTIALS=<path to json file>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow

./venv/bin/python ./data/df/preprocessing.py \
  --runner=DataflowRunner \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_repo_contents_v1" \
  --output_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
  --temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
  --save_main_session \
  --machine_type=n1-highmem-8

Where to find a data example?

I've run the pipeline for our sampled dataset sample_repo_contents_v1 (15% of the initial one) to check whether it works correctly. Please, find the output preprocessed dataset in sample_preprocessed_dataset_v1.

Ref ai-assist#22 (closed)

cc @mray2020

Edited Mar 06, 2023 by Alexander Chueshev

Add a DF pipeline for basic dataset preprocessing

How to run the pipeline:

Where to find a data example?

Merge request reports