Fetching run logs, info and status fails or is incorrect
Summary
When fetching logs I receive Internal Server Error
and when fetching info I receive false info (RUNNING
while SUCCESSFUL
)
Steps to reproduce
- Fetch logs: Internal Server Error
- Fetch info: reports
RUNNING
, while the MLflow run was successful. So all info might be incorrect/outdated - Fetch stats: reported
FAILED
while the HPC job scheduler as well as MLflow shows that the job successfully completed.
Software versions
- Device model:
- OS version:
- Software versions:
- Browser version:
What is the current bug behavior?
What is the expected correct behavior?
- When I fetch the logs it doesn't fail
- When I fetch the info, I get the current and correct info
- The status should be consistent with the UNICORE status
Relevant logs and/or screenshots
Check Slack alerts from 2024-05-16 09:00-09:15.
LOGS
<Put log output here>
Possible fixes
/cc @rico.berner