Add bigquery sql query to build the raw dataset
This MR implements the sql query used to build the raw dataset from the BigQuery GitHub open dataset - https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code.
Three steps:
- estimating attention for each repo (
githubarchive
datasource), i.e., max between stars and watch count - joining files with the repos having at least 50 stars (similar to Polycoder)
- joining files with the content and applying deduplication
Supported languages:
- C
- C++
- C#
- Go
- Javascript
- Java
- Php
- Python
- Ruby
- Rust
- Scala
- Typescript
- Kotlin (thanks @nkhalwadekar for the reminder to add this PL)
Note:
- exact deduplication is already implemented within the SQL query
- I run this script using the BigQuery Analytic Hub.
- for testing queries, please use the sampled tables (15% of the original dataset) - cost + time optimization
Built dataset - link
Edited by Alexander Chueshev