Add `AvoidIoRead` Cop that flags memory intensive IO calls
What does this MR do?
Closes #209922 (closed)
Reading files or IO streams into memory in their entirety (i.e. until EOF) can have adverse effects on memory consumption and GC behavior, so we should avoid doing this for files exceeding, say, a few hundred KB in size.
This cop is somewhat aggressive in that a lot of calls to e.g. File.read
are harmless if the file is small (e.g. a YAML config), so it has a relatively high chance of identifying false positives. Moreover, not all calls to IO.read
can be done line-by-line since sometimes you need the entire structure in memory (e.g when making it available as a Ruby hash.)
To mitigate that I already excluded the following files and usages from being linted by this cop:
-
config
files - any tests or test related files
- any custom scripts
- rake tasks
- packages where we commonly
read
via other types, e.g.Redis
- calls to
read
andreadlines
that take a path based onRails.root
(we can assume these are not massive files.)
On the other hand, the few calls that are true positives could potentially do significant damage to our resource footprint (as explained in the ticket, it's not just the raw memory that's allocated which is problematic, but also the effect that it has on future GC runs.)
I believe the cop detected two cases which seem legit, one of which we already knew was a problem (the legacy importer) and which is being tackled in &2734)
The other match that caught my eye was also in the import module, in lfs_restorer
: https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/import_export/lfs_restorer.rb#L69-79
I'm not sure what this is doing, i.e. whether it's just restoring smallish amounts of metadata or actualy large files, but the fact that it's LFS related might be worth looking at this a bit closer. It also looks like that JSON file is only read for a single attribute, then discarded again.
Alternative approaches to this or using a Cop
in general:
- Merely document this somewhere for now. Pros: least disruptive. Cons: less direct nudge, devs might not know about it or decide to ignore it.
- Only enable it for modules where we know this could get out of hand (such as
lib/import
). Pros: reduces noise in areas where it's unlikely to be a problem. Cons: if we add a new module where it's desirable to run, we will most likely forget to include it. - Patch
File.read[lines]
to perform asize
check first; if thesize
is above a given threshold, fail fast with an error. Pros: simple to implement and would let most cases pass. Cons: It's impossible to know upfront when this might make something fall over, esp. when dynamic data is involved such as with imports.
Please let me know what your thoughts are!