This is an old revision of the document!
HPC Jobs
In the context of HPC, a job is a task (a program or script) that you ask the computer to run. Jobs are managed by a scheduler that accepts submissions from all cluster users and works out when and on which resources the job will run. When you connect to the cluster, you connect to a login node or master node. Jobs are not run on the login node as the high resource demand of the job would hinder the functionality of this node, which is to provide access to your files, and accept job submissions. Instead, your jobs should be run on the compute nodes by submitting a request to the scheduler and allowing the scheduler to allocate a compute node (or nodes) for your job execution and executing the job when the nodes are available. Jobs can take two forms: interactive and non-interactive. Interactive jobs are jobs that require you to provide input while the job is running. For these jobs, the scheduler will allocate a compute node and will connect you to an interactive shell on the allocated node. From the scheduler's standoint, the job is running as long as the shell is open; when the shell is exited, the job is completed. Interactive jobs can also be applications, like Jupyter Notebook or RStudio, that allow user interaction. Non-interactive jobs are jobs that can be executed on a compute node and do not require any interaction or input while the job is running.
Interactive Jobs
An interactive job can be run on a compute node using the salloc command. The salloc command obtains a resource allocation from the scheduler and executes a command. For an interactive job, the command that is passed to salloc to execute is an interactive shell like bash. Once the command is finished executing (i.e. when the bash shell is exited) the allocated resources are released.
salloc has a number of options that can be used to specify the requested resources and set some properties of the job. Some useful and frequently used options are:
- Set Working Directory (
-D) The remote process will change into this directory before running - Nodes (
-N) sets the number of nodes to be allocated to the job - Memory required per node (
–mem), default unit is megabytes - Time limit (
-t=NN:NN) Set a time limit on the runtime - a job that exceeds this limit may be killed - Name (
-J) A name for the job (this makes it easier to spot in logs and scheduler status)