Prometheus client PID is not set correctly
I've noticed metric corruption is still happening in staging environment.
My current suspicion backed by fact that files with only one PID are present in the data folder is that all workers use Unicorn master's PID to write metrics.
This also is corroborated by the source code that sets
@@pid = Process.pid
which will cause all workers to inherit this.
Additionally this also might have surfaced another bug that is either happening or would have happened. I.e. if any of the metrics was accessed in Unicorn master process, thus causing metric file to be created and its path stored in another class variable @@files
. This would cause multiple processes to access the same file and cause corruption due to differing internal state.
This error could have also affected sidekiq metrics. Also recent measures that are silencing the errors when gathering metrics seems to be working since staging can still function correctly even with corrupted metrics.
Additionally I think it would be great to:
-
hotpatch staging with updated code before 9.4.2 and see if its running correctly for extended period, to detect if there are still some outstanding problems -
add integration tests to prometheus-mmap-client in which a metrics are used in multiprocess environment.