Add package metadata ingestion service
What does this MR do and why?
Add a service for inserting package metadata into the rails db.
Related to: #383723 (closed)
This issue is part of Sync Rails backend with License DB (&9349 - closed) which is in turn a sub-epic of Replace license-finder MVC (&8072 - closed).
PackageMetadata::SyncService is responsible for calling this service with a list of data to be imported into the database's pm_
tables. This service is responsible for doing bulk_upserts in an idempotent way. It uses BulkInsertableTask
for this change.
Note
The tables for the models being populated are currently empty. This is the MR that adds the functionality to populate them.
How to test this MR
Run the service in console and note that all dependent models are updated when the service is run. The inserts should be in a transaction.
bundle exec rails c
PackageMetadata::Ingestion::IngestionService.execute([PackageMetadata::DataObject.new('package-1','v1.0.0','Apache license','composer')])
Output should look something like this.
[1] pry(main)> PackageMetadata::Ingestion::IngestionService.execute([PackageMetadata::DataObject.new('foo','v1','mit','composer')])
TRANSACTION (0.1ms) BEGIN /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/lib/gitlab/database/schema_cache_with_renamed_table.rb:25:in `columns'*/
#<Class:0x0000000130feda90> Upsert (0.3ms) INSERT INTO "pm_packages" ("purl_type","name","created_at","updated_at") VALUES (1, 'foo', '2023-02-06 19:06:30.374401', '2023-02-06 19:06:30.374401') ON CONFLICT ("purl_type","name") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","purl_type","name" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
#<Class:0x00000001301cffd0> Upsert (0.2ms) INSERT INTO "pm_package_versions" ("pm_package_id","version","created_at","updated_at") VALUES (697123, 'v1', '2023-02-06 19:06:30.469499', '2023-02-06 19:06:30.469499') ON CONFLICT ("pm_package_id","version") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","pm_package_id","version" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
#<Class:0x0000000135373e90> Upsert (0.2ms) INSERT INTO "pm_licenses" ("spdx_identifier","created_at","updated_at") VALUES ('mit', '2023-02-06 19:06:30.476160', '2023-02-06 19:06:30.476160') ON CONFLICT ("spdx_identifier") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","spdx_identifier" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
#<Class:0x000000013541cae0> Upsert (0.2ms) INSERT INTO "pm_package_version_licenses" ("pm_package_version_id","pm_license_id","created_at","updated_at") VALUES (3263720, 1138230, '2023-02-06 19:06:30.481580', '2023-02-06 19:06:30.481580') ON CONFLICT ("pm_package_version_id","pm_license_id") DO UPDATE SET "updated_at"=excluded."updated_at" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
TRANSACTION (0.1ms) COMMIT /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/lib/gitlab/database.rb:375:in `commit'*/
Some notable changes
Migration to add timestamps to pm_ tables
There as several data migrations to the underlying model tables to add timestamps. This has to do with the underlying ActiveRecord insert_all
functionality converting on_conflict update
calls to on conflict ignore
. This is a peculiarity of ActiveRecord::InsertAll
and after several attempts to work around it, we decided to bite the bullet and add timestamps. More information here: !108600 (comment 1234620616)
Migration to add id column
There is also a change to add an id
column to pm_package_version_licenses
. Because rails doesn't recognize compound primary keys things become more difficult than they have to be (especially in factories). For example doing PackageVersionLicense.create
without a single primary key column causes an error because rails assumes that there is always a single primary key column to return (generating a RETURNING id
clause).
Migration to change default of null on column
pm_package_version_licenses
table had a default: null
column on pm_package_id
which is incorrect for a join table. And it also prevents upserts from working properly.
has_many relationships added to several models
This addition was not present and is not needed for bulk inserts, but ensures that factories work correctly in specs (e.g. associations work).
Change to factories and to ee/spec/lib/gitlab/license_scanning/package_licenses_spec.rb
The factories are refactored to make them more generic so that they can be used by specs in this MR and by package_licenses_spec.rb
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.