Mantik api handles management of runs on SSH accessible remote system
Summary
As a machine learning specialist, I want be able to train my models though an SSH connection so I can train on more infrastructure.
Acceptance Criteria
-
mantik run can be triggered on a remote system through an SSH connection
Given I have an SSH connection
And the target computer has SLURM configured
When I create a run via mantik UI
And I select the SSH connection
Then the run gets submitted to the compute backend
-
Run triggered has accessible run status
Given I have triggered a run
And the run was configured for SSH
When I fetch the run status
Then I get the status
-
Run triggered through SSH has accessible logs
Given I have triggered a run
And the run was configured for SSH
When I fetch the run logs
Then I get the run logs
-
Run triggered through SSH can be cancelled
Given I have triggered a run
And the run was configured for SSH
When I cancel the job
Then the job terminates on the remote compute system
-
Run triggered returns job info
Given I have triggered a Run
And the run was configured for SSH
When I request the job info
Then I get the job info
-
Run triggered through SSH allows to download files/folders from working directory
Given I have triggered a run
And the run was configured for SSH
When I request to download a file/folder from the run's working directory
Then I receive the file/folder
Testing
-
acceptance criteria -
edge cases
Given I have an SSH connection
When I trigger an SSH run
And the target machine does not accept the SSH key
Then then I receive a 403 unauthorized error
And the error contains a descriptive message that the SSH key might be wrong.
Technical Information
There should be no interface change when submitting an "SSH" run. The difference should be that the linked connection is an ssh connection
, and the backend config contains the target computer's IP/uri
Submission info fields
- "JobID",
- "State",
- "User",
- "JobName",
- "NodeList",
- "NNodes",
- "Partition",
- "CPUTime",
- "Elapsed",
- "Start",
- "End",
- "ExitCode",
- "WorkDir",
- "ConsumedEnergy",
- "Submit"
Suggested Implementation
Mantik API
-
extend mantik api (trigger Run) to forward SSH connection details to compute backend (as defined by the compute backend. If different endpoint, then POST to that endpoint. Otherwise pass the necessary variables) -
use paramiko to establish SSH connections (or look into https://github.com/talmo/hpctools) -
Implement logs, info, and status querying functionality for an SSH run (mantik api). Lookup how it's done for unicore. -
extend the RUN database model correspondingly, to support SSH. extend backend_type
, add "Remote Compute System with SSH" or something similar -
extend project_routes/_compute_backend.py
file to fetchJob
from SSH connection. -
Implement mantik_api/compute_backend.job
'sJobBase
class for anSSH with SLURM
job. Check out the unicore and firecrest implementations for inspiration.
/cc @rico.berner
Edited by Jakub Jagielski