Add a DF pipeline for basic dataset preprocessing
This MR adds the DF pipeline that:
- reads data from BQ
- applies the following preprocessing similar to Polycoder and Codex:
- filter by line max, line min
- filter out autogenerated files
- exact deduplication after
str.strip()
- filter by the maximum fraction of non-alphanumeric characters
- infers PL from the file path names
- split the full dataset into training, test, and validation sets
- writes the preprocessed dataset back to BQ
How to run the pipeline:
export GOOGLE_APPLICATION_CREDENTIALS=<path to json file>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow
./venv/bin/python ./data/df/preprocessing.py \
--runner=DataflowRunner \
--project=$GCP_PROJECT \
--region=$GCP_REGION \
--input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_repo_contents_v1" \
--output_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
--temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
--save_main_session \
--machine_type=n1-highmem-8
Where to find a data example?
I've run the pipeline for our sampled dataset sample_repo_contents_v1
(15% of the initial one) to check whether it works correctly.
Please, find the output preprocessed dataset in sample_preprocessed_dataset_v1
.
cc @mray2020
Edited by Alexander Chueshev