Draft: `GET /api/:version/groups/:id/projects` performance investigation
What does this MR do and why?
Describe in detail what your merge request does and why.
Performance investigation for groups -> projects API endpoint. See #330140 (comment 687343917)
Related to #330140 (closed)
Overview
Copied from Issue comment:
Status
- I've looked into this and gotten some context, and also some good advice/insight from others. I've also re-reviewed the info @mksionek provided on a similar issue in this thread, including her suggestion of watching this excellent video.
- It seems there aren't any obvious n+1 problems (TODO: we should update the issue title to reflect this), but it is a very complex request with a lot going on.
- There seem to be two paths available to improve performance:
- DB: One single slow query which takes up a majority of the request time in some cases (more info in this thread above.
- Note that there are dozens (hundreds?) of other queries being performed as part of some of these requests, but based on initial investigation, few or none of these seem to be slow (at least not longer than 1 second to show up in Kibana), and many of them are repeated and cached.
- CPU (Ruby/Rails): These requests can also spend several seconds in "view duration". Based on this thread above, it seems like the "view duration" for a Grape API call essentially means all "CPU" or non-DB time.
Next Steps
Following the good advice from the video, the next steps should be:
- Get a baseline (benchmark)
- Look for culprits
- Keep benchmarking
So, my specific plans are:
- Get a baseline and benchmark
- Some of the suggested tools only run on linux, so I've set up a Virtualbox linux instance with a clean GDK install for this purpose.
- Then, create enough data to simulate the slowness we see on production. I plan to start with Gosia's example script here, but since this problem is related to nested groups (maybe deeply, maybe widely?), it will need to also be able to create that structure as well.
- Then, we can use the appropriate tools to get a repeatable and reliable benchmark.
- Look for culprits
- As far as the slow DB query, I am not a DBA nor a postgres expert (my recent years of experience are primarily with mysql), and this is a complex query. So, once I have a reproducible benchmarked local environment for the slow query, I will probably reach out to other team members with more DBA experience for any ideas on optimizing this query.
- As far as the CPU/Ruby ("View duration") slowness, I will try the various tools suggested in the video to try to narrow down any bottlenecks in the Ruby code, and see if there are any possible optimizations I can identify.
NOTES
Performance Investigation Talk
- Talks: https://eileencodes.com/speaking/#how-to-performance
- Slides: https://speakerdeck.com/eileencodes/goruco-2015-how-to-performance
Tools
- benchmark-ips - https://github.com/evanphx/benchmark-ips
Steps
NOTE: These plans may change based on information we find, but will not necessarily be kept up to date.
- Set up a linux virtual machine with GDK (some of the tools only work in linux)
- Write a script to create a big hierarchy of groups and projects (TODO: We could have used FactoryBot and
with_hierarchy
for this). - Get an initial benchmark with
benchmark-ips
- Start looking for culprits
- Try to fix any suspicious culprits
- Always Be Benchmarking
- Profit.
Benchmarking Challenges
It's difficult to benchmark a long-running request in a reliable, repeatable way, especially in a local environment, and especially if the dev machine has thermal throttling issues. See more details in this thread: !71246 (comment 690466371)
RubyProf
Note on profiling in dev
In the development environment, there are several things enabled which confound the results. For example, the Bullet N+1 query detector accounts for a significant percentage of the request time. This must be explicitly disabled (with a code hack, there's no ENV var to override in dev) in order to eliminate this from the request profiling.
RubyProf Results
RubyProf Run 01
Setup:
- Local dev environment
- Bullet disabled by forcing
lib/gitlab/bullet.rb#configure_bullet?
to returnfalse
- A single request against a group hierarchy 3 deep, 3 wide, with 2 projects per group, with the following HTTP request (RubyMine HTTP Request Format):
GET http://gdk.test:3000/api/v4/groups/test6/projects?include_subgroups=true
Accept: application/json
Content-Type: application/json
PRIVATE-TOKEN: XXX
log/api_json.log entry:
{"time":"2021-10-01T22:29:38.673Z","severity":"INFO","duration_s":5.61192,"db_duration_s":0.49877,"view_duration_s":5.11315,"status":200,"method":"GET","path":"/api/v4/groups/test6/projects","params":[{"key":"include_subgroups","value":"true"}],"host":"gdk.test","remote_ip":"127.0.0.1","ua":"Apache-HttpClient/4.5.13 (Java/11.0.12)","route":"/api/:version/groups/:id/projects","user_id":1,"username":"root","queue_duration_s":1.921156,"gitaly_calls":20,"gitaly_duration_s":0.906728,"redis_calls":24,"redis_duration_s":0.053672000000000004,"redis_read_bytes":11130,"redis_write_bytes":15512,"redis_cache_calls":23,"redis_cache_duration_s":0.051346,"redis_cache_read_bytes":11130,"redis_cache_write_bytes":15464,"redis_shared_state_calls":1,"redis_shared_state_duration_s":0.002326,"redis_shared_state_write_bytes":48,"db_count":224,"db_write_count":0,"db_cached_count":114,"db_replica_count":0,"db_replica_cached_count":0,"db_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_count":224,"db_primary_cached_count":114,"db_primary_wal_count":0,"db_primary_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":1.164,"cpu_s":5.766931,"pid":79417,"correlation_id":"01FGYZM4C1KG7RCH9CQ1SKZVA5","meta.user":"root","meta.caller_id":"GET /api/:version/groups/:id/projects","meta.remote_ip":"127.0.0.1","meta.feature_category":"subgroups","meta.client_id":"user/1"}
Profile HTML file: callstack-01.html