Implement DF pipeline to export dataset from BQ
This MR implements the DF pipeline to export either training, test or validation dataset from BigQuery for the specified languages.
How to run:
Using the local direct runner:
export GOOGLE_APPLICATION_CREDENTIALS=<path to json key>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow
./venv/bin/python ./data/df/export-bq.py \
--runner=DirectRunner \
--project=$GCP_PROJECT \
--region=$GCP_REGION \
--input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
--language=c \
--language=python \
--split="test" \
--output_path="data/export/" \
--temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
--save_main_session
Using the dataflow runner:
export GOOGLE_APPLICATION_CREDENTIALS=<path to json file>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow
export GCP_BUCKET_EXPORT=code-suggestions
./venv/bin/python ./data/df/export-bq.py \
--runner=DataflowRunner \
--project=$GCP_PROJECT \
--region=$GCP_REGION \
--input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
--language=c \
--language=python \
--language=ruby \
--language=rust \
--split="test" \
--output_path="gs://${GCP_BUCKET_EXPORT}/data/export/sample/20230314/" \
--temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
--save_main_session
Example of the DF pipeline used to export the following sample to GCS.
How to use HF datasets with the data locally
Please, find the snippet of how to load the exported dataset using HF datasets
. In this example we assume the following directory structure:
data/export
-- train
-- c
-- ruby
-- rust
-- test
-- c
-- ruby
-- rust
from datasets import load_dataset
if __name__ == "__main__":
dataset = load_dataset("data/export/")
print(dataset["train"][0]["content"])
print(dataset["test"][0]["content"])
Edited by Alexander Chueshev