Track failed Service Pings for self-managed instances
Summary
Track failed Service Pings for self-managed instances
With this issue, we add minimal information about failing service ping.
Duration and failing reason could be added in follow-up issues
Why
We currently don't have insights when a Service Ping from Self Managed instances failed. This could happen due to multiple reasons – either a Service Ping metric times out, or the Sidekiq job times out before.
Requirements
- Service Ping payload should include extra information if a generation failed and the reason for the failure.
Proposal
Note: This is a proposal, more ideas could come up, the final solution we implement can be discussed with the team.
- Add a new key path dedicated to extra information about Service Ping generation
- Add information if Service Ping failed and the reason inside the new key path
- New keypath could be called meta, extra(this is just a suggestion, please add more proposals)
- Note that we have the service ping filtering and we add only metrics that have a metric definition.
- This could be added at the very end of the service ping generation.
- Check with the data team if having this field will be ok with the data processes we have in place
Example of payload with the extra information
successful service ping
{
uuid: '0000-0000-0000'
counts: {
issues: 1000,
}
extra(meta): {
generation: {
status: success
duration: 3600(total time in seconds) # this can be a follow up issue
}
}
}
broken service ping
{
extra(meta): {
generation: {
status: failed,
duration: 3600(total time in seconds), # this can be a follow up issue
error: Sidekiq timeout # this can be a follow up issue
}
}
}
Edited by Alina Mihaila