This is an old revision of the document!
HPC Jobs
In the context of HPC, a job is a task (a program or script) that you ask the computer to run. Jobs are managed by a scheduler that accepts submissions from all cluster users and works out when and on which resources the job will run. When you connect to the cluster, you connect to a login node or master node. Jobs are not run on the login node as the high resource demand of the job would hinder the functionality of this node, which is to provide access to your files, and accept job submissions. Instead, your jobs should be run on the compute nodes by submitting a request to the scheduler and allowing the scheduler to allocate a compute node (or nodes) for your job execution and executing the job when the nodes are available. Jobs can take two forms: interactive and non-interactive. Interactive jobs are jobs that require you to provide input while the job is running. For these jobs, the scheduler will allocate a compute node and will connect you to an interactive shell on the allocated node. From the scheduler's standoint, the job is running as long as the shell is open; when the shell is exited, the job is completed. Interactive jobs can also be applications, like Jupyter Notebook or RStudio, that allow user interaction. Typically interactive jobs are run to experiment and test scripts and workflows. Non-interactive jobs are jobs that can be executed on a compute node and do not require any interaction or input while the job is running. Generally speaking, your goal is to design jobs that can reliably run non-interactively so they can be submitted to the scheduler and you can return to the output when the job is finished which may be hours, days or weeks later.
Interactive Jobs
An interactive job can be run on a compute node using the salloc command. The salloc command obtains a resource allocation from the scheduler and executes a command. For an interactive job, the command that is passed to salloc to execute is an interactive shell like bash. Once the command is finished executing (i.e. when the bash shell is exited) the allocated resources are released.
salloc has a number of options that can be used to specify the requested resources and set some properties of the job. Some useful and frequently used options are:
- Set Working Directory (
-D) The remote process will change into this directory before running - Nodes (
-N) sets the number of nodes to be allocated to the job - Tasks (
-n) specifies the max number of tasks that steps (discrete commands in the shell) will run - Memory required per node (
–mem), default unit is megabytes - Time limit (
-t=days-HH:MM:SS) Set a time limit on the runtime - a job that exceeds this limit may be killed - Name (
-J) A name for the job (this makes it easier to spot in logs and scheduler status queries)
Example command requesting 2 nodes for a job you anticipate will take 4 hours and where you plan to run a script that will require 8 tasks:
$ salloc -N 2 -n 8 -t 04:00:00
The execution environment allocated by this command includes 2 nodes with 8 cores distributed across those nodes.
Commands issued within this shell will still be executed on the master node unless invoked with the SLURM command srun. If you issue the srun command within this execution environment, SLURM will distribute the tasks invoked by the srun command across these resources. Each invocation of the srun command is a job step within the salloc job.