Skip to content

Cached project-based ETAG lookup for GitHub email

What does this MR do and why?

Updates the way we get a GitHub user's email during GitHub project import.

Background

When importing resources such as issues and pull requests from GitHub, we also need to associate a user on GitLab to the resource. This requires a mapping between GitHub and GitLab users. To avoid duplicated API requests, the mapping is cached.

In Shorten cache time for private emails in GitHub... (!70293 - merged), the cache timeout was changed from 24 hours to 15 minutes. This leads to reaching the API limit more often, which means we retry after an hour, meaning the values have to be calculated again.

So if a user does not have a public email configured, the number of imported resources per hour significantly decreases. It may drop from approximately 500K to around 5K in the worst-case scenario.

This MR

Introduces two new caches to make one ETAG API request per user project if the email is not found on the first API request. When a request is made with the ETAG passed as a header and the user on GitHub did not change, it will respond with a 304 Not Modified. If the user resource did change, it responds with a 200 and the cache storing the email (shared across all projects) is updated.

By using Conditional Requests, if the response is a 304, it does not count against the API rate limit.

The two new caches are as follows:

  1. github-import/user-finder/etag/<username>: Stores the ETAG from the response from GitHub. One entry per username.
  2. github-import/user-finder/<project.id>/email-fetched/<username>: Stores a 1 if the project has already made an ETAG request. In that case, we don't try again for the project. One entry per project and username.

We check if the resource changed if the email from the cache is nil or blank. The reason is that blank email indicates that an attempt was made previously but the user did not have an email. In this case, we want to try to get the email once per project in case it has been updated in the meantime.

Scenarios

Screenshot_2023-08-24_at_05.38.52

How often does the conditional request respond with not 304?

A conditional request returns a modified response if the resource has changed. For users, this would be:

  • Account details: email, username, name, admin?, company, avatar, 2FA enabled, plan
  • Stats: # of repos, # of gists, # of followers and following, # of collaborators
  • Other: bio, location, twitter username, blog

How many API requests are expected?

For every user mapped:

  • Rate limited requests:
    • 1 every 24 hours
  • Non-rate limited requests:
    • Best case: 0 (if the email is found on the first try)
    • Worst case: 1 per project every 24 hours

How to set up and validate locally

On GitHub:

  1. Create 3 projects with some issues, PRs, comments, etc. assigned to your user.
  2. Set your email to non-public: https://github.com/settings/profile > Public email > remove the email > Update

On GitLab:

  1. Create a user with the same username and email but don't sign in as this user.
  2. Tail the importer logs to track requests made to GitHub's API: tail -f log/importer.log | grep "Fetching email from GitHub"
  3. From the GitHub importer, import the 3 projects within a few seconds of each other.
  4. See that in the logs you have:
    1. One request without ETAG: Fetching email from GitHub for a project id
    2. Two requests with ETAG: Fetching email from GitHub with ETAG header for the remaining two project ids Screenshot_2023-08-24_at_09.43.11
  5. Verify that the issues, MRs, comments, etc. are assigned to the user who performed the import.

Test with a valid public email.

On GitHub:

  1. Set your email to public: https://github.com/settings/profile > Public email > Select the email > Update

On GitLab:

  1. Tail the importer logs to track requests made to GitHub's API: tail -f log/importer.log | grep "Fetching email from GitHub"
  2. From the GitHub importer, re-import the 3 projects within a few seconds of each other.
  3. See that in the logs you have:
    1. Three requests with ETAG: Fetching email from GitHub with ETAG header for all 3 project ids Screenshot_2023-08-24_at_09.57.15
  4. Verify that the issues, MRs, comments, etc. are now correctly assigned to the matching user.

Optional: Test with email set from the start

On GitLab:

  1. Clear the cached values for the importer.
  2. Tail the importer logs to track requests made to GitHub's API: tail -f log/importer.log | grep "Fetching email from GitHub"
  3. From the GitHub importer, re-import the 3 projects within a few seconds of each other.
  4. See that in the logs you have:
    1. One request without ETAG: Fetching email from GitHub and no further API calls. This means the email was on the first try and cached. Screenshot_2023-08-24_at_10.02.20
  5. Verify that the issues, MRs, comments, etc. are correctly assigned to the matching user.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #416308 (closed)

Edited by Madelein van Niekerk

Merge request reports

Loading