Track failed Service Pings for self-managed instances

Summary

With this issue, we add minimal information about failing service ping.

Duration and failing reason could be added in follow-up issues

Why

We currently don't have insights when a Service Ping from Self Managed instances failed. This could happen due to multiple reasons – either a Service Ping metric times out, or the Sidekiq job times out before.

Requirements

Service Ping payload should include extra information if a generation failed and the reason for the failure.

Proposal

Note: This is a proposal, more ideas could come up, the final solution we implement can be discussed with the team.

Add a new key path dedicated to extra information about Service Ping generation
Add information if Service Ping failed and the reason inside the new key path
New keypath could be called meta, extra(this is just a suggestion, please add more proposals)
Note that we have the service ping filtering and we add only metrics that have a metric definition.
This could be added at the very end of the service ping generation.
Check with the data team if having this field will be ok with the data processes we have in place

Example of payload with the extra information

successful service ping

{
 uuid:  '0000-0000-0000'
 counts: {
   issues: 1000,
  }
 extra(meta): {
   generation: {
     status: success
     duration: 3600(total time in seconds) # this can be a follow up issue
   }
 }
}

broken service ping


{
 extra(meta): {
   generation: {
     status: failed,
     duration: 3600(total time in seconds), # this can be a follow up issue
     error: Sidekiq timeout # this can be a follow up issue
   }
 }
}

Edited Nov 10, 2021 by Alina Mihaila