Skip to content

Add bigquery sql query to build the raw dataset

Alexander Chueshev requested to merge add-bigquery-sql into main

This MR implements the sql query used to build the raw dataset from the BigQuery GitHub open dataset - https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code.

Three steps:

  • estimating attention for each repo (githubarchive datasource), i.e., max between stars and watch count
  • joining files with the repos having at least 50 stars (similar to Polycoder)
  • joining files with the content and applying deduplication

Supported languages:

  • C
  • C++
  • C#
  • Go
  • Javascript
  • Java
  • Php
  • Python
  • Ruby
  • Rust
  • Scala
  • Typescript
  • Kotlin (thanks @nkhalwadekar for the reminder to add this PL)

Note:

  1. exact deduplication is already implemented within the SQL query
  2. I run this script using the BigQuery Analytic Hub.
  3. for testing queries, please use the sampled tables (15% of the original dataset) - cost + time optimization

Built dataset - link

Ref: ai-assist#22 (comment 1292021104)

Edited by Alexander Chueshev

Merge request reports

Loading