Fetching of logs fails with 500 Internal Server Error after a certain time
Summary
500 Internal Server Error when fetching logs from the trail UNICORE instance.
Steps to reproduce
- Submit run to trial UNICORE instance
- Fetch status and logs as usual, everything works fine
- After a couple of hours, attempt to fetch logs again -> fails with 500 Internal Server Error
There seems to be some time interval after which jobs become unavailable (getting deleted perhaps)
Software versions
- Device model:
- OS version:
- Software versions:
- Browser version:
What is the current bug behavior?
500 Internal Server Error
What is the expected correct behavior?
Error response with an informative message
Relevant logs and/or screenshots
LOGS
| [2023-10-17 12:06:02 +0000] [36] [ERROR] Exception in ASGI application |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Traceback (most recent call last): |
| File "/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi |
| result = await app( # type: ignore[func-returns-value] |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__ |
| return await self.app(scope, receive, send) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/fastapi/applications.py", line 270, in __call__ |
| await super().__call__(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/applications.py", line 124, in __call__ |
| await self.middleware_stack(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__ |
| raise exc |
| File "/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__ |
| await self.app(scope, receive, _send) |
| File "/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 84, in __call__ |
| await self.app(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 75, in __call__ |
| raise exc |
| File "/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 64, in __call__ |
| await self.app(scope, receive, sender) |
| File "/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__ |
| raise e |
| File "/venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__ |
| await self.app(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/routing.py", line 680, in __call__ |
| await route.handle(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/routing.py", line 275, in handle |
| await self.app(scope, receive, send) |
| File "/venv/lib/python3.11/site-packages/starlette/routing.py", line 65, in app |
| response = await func(request) |
| ^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/fastapi/routing.py", line 235, in app |
| raw_response = await run_endpoint_function( |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/fastapi/routing.py", line 161, in run_endpoint_function |
| return await dependant.call(**values) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/mantik_api/routes/rbac/authorization.py", line 109, in wrapper_auth |
| return await func(*args, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/mantik_api/routes/project_routes/runs.py", line 471, in projects_project_id_runs_run_id_logs_get |
| return job.get_logs() |
| ^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/mantik_api/unicore/job.py", line 76, in get_logs |
| unicore_api_logs = self._get_unicore_api_logs() |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/mantik_api/unicore/job.py", line 90, in _get_unicore_api_logs |
| logs = self.get_properties().logs |
| ^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/mantik_api/unicore/job.py", line 54, in get_properties |
| id_=self.id, data=self._job.properties, bss_details=self._job.bss_details() |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/pyunicore/client.py", line 501, in bss_details |
| return self.transport.get(url=self.links["details"]) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/pyunicore/client.py", line 176, in get |
| res = self.run_method(requests.get, **kwargs) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| File "/venv/lib/python3.11/site-packages/pyunicore/client.py", line 165, in run_method |
| self.check_error(res) |
| File "/venv/lib/python3.11/site-packages/pyunicore/client.py", line 143, in check_error |
| raise requests.HTTPError(msg, response=res) |
| requests.exceptions.HTTPError: 500 Server Error: Could not get job details: java.lang.Exception: Getting job details on TSI failed: reply was TSI_FAILED: Command 'scontrol show jobid 40' failed with code 1: b'slurm_load_jobs error: Invalid job id specified\n' |
| for url: https://unicore.dev2.cloud.mantik.ai/DEMO-SITE/rest/core/jobs/964f2ca1-7acd-4a3c-83cb-1583e4b96907/details |
Possible fixes
Catch error response from UNICORE API and return dedicated error message.
/cc @thomas_ambrosys /cc @fabian.emmerich
Edited by Fabian Emmerich