Skip to content

Add GCP connector for importing package metadata

Igor Frenkel requested to merge 383797-gcp-connector-for-package-metadata into master

What does this MR do and why?

This MR adds a connector object for retrieving package metadata csvs from a gcp bucket.

Related: #383797 (closed)

Reviewer context

This issue is part of Sync Rails backend with License DB (&9349 - closed) which is in turn a sub-epic of Replace license-finder MVC (&8072 - closed).

In brief:

  • External License Database is the service doing the actual work of finding package metadata
  • the synchronization functionality of the issue's epic is responsible for mirroring the external data in the GitLab Instance DB
  • the mirrored data facilitates queries for fetching the licenses of project dependencies

MR context

The synchronization method between the External License Database and GitLab instances will be using a public gcp bucket. This MR utilizes the google-cloud-storage library to create an anonymous gcp connection and fetch the data. The bucket structure has a prefix "directory" for each package registry type and version format (e.g. <bucket>/v1/rubygems).

The design of the bucket is meant to limit resource use by breaking up data into separate files:

  • Each file is a csv.
  • The name of the file is a unique identifier (sequence_id/chunk_id).
  • The caller can use the identifier to store the last synchronized file for a particular package registry.
  • This allows the caller to limit resource usage by not re-importing the entire bucket.

The connector yields csv data from all the files found after the unique identifier passed. It also yields the sequence and chunk of the file being read so that the caller can save this identifier as the position to use in the next sync.

Interface

The client for this connector is the SyncService:

  • requests bucket data with the option of requesting data only after a given file in the bucket
  • expects to be yielded
    • file data
    • the position info of the file yielding this data
  • it will save the yielded data
  • it will save the position info so that it can start from that point next

How to test this MR locally

In order to try this code "live" against a test data set (https://storage.googleapis.com/ifrenkel-test1-licenses)

  • check out this branch
  • bundle exec rails console
  • run the code below
conn = Gitlab::PackageMetadata::Connector::Gcp.new('ifrenkel-test1-licenses', 'v1', 'rubygems')
total=0
conn.data_after(Hashie::Mash.new(sequence_id: '1672196094', chunk_id: '00000001.csv')).each do |file|
  num = 0
  file.each do |l|
    num += 1
  end
  puts "file: #{file.sequence}/#{file.chunk}, num lines: #{num}"
  gcp_file = file.instance_variable_get(:@file)
  puts "  gcp url: #{gcp_file.gapi.media_link}"
  total += num
end
puts "total lines: #{total}"

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #383797 (closed)

Edited by Igor Frenkel

Merge request reports

Loading