quickstart

Cluster Introduction

Faculty and their research students can request access to the cluster by contacting Mike Conner [connerms], Linux System Administrator, Information Technology Services. Access is granted through the Login, or Master node. The Master node can be used to manage files, prepare jobs, and submit jobs to the scheduler. The scheduler is software that matches all the incoming jobs with their various requirements, matches those jobs to compute resources, and determines when the jobs will be run. Jobs are submitted to the scheduler using a submission script (or submission file) that includes parameters that are passed to the scheduler (about where to place the output of the job, how long the job is expected to run, how many cores are needed, etc) and the actual workload for the compute nodes (i.e. the program or script to run). Jobs are placed in a queue and are run when the necessary resources become available. Policies configured in the scheduler also try to balance priorities, reduce time waiting in the queue, and maximize the use of the resources.

To prepare jobs for the scheduler, users write a shell script, the submission script, that sets all necessary variables and contains all commands to be run.

Connecting to the Grinnell High Performance Compute Cluster

After obtaining an account on the cluster, the easiest way to connect to the cluster and begin setting up your workload and submitting jobs is to direct your browser to https://hpc.grinnell.edu. You'll need to login when prompted using your Grinnell College credentials.

Open OnDemand will give you easy access to tools to create and submit jobs, manage your files on the cluster, even access an interactive shell on the cluster. Open OnDemand (web-based) connections are restricted to network connections that are wired, on campus, or connected to the Grinnell College secure wireless network.

Cluster users may also connect to the cluster via SSH. SSH connections are restricted to network connections that are wired, on campus, or connected to the Grinnell College secure wireless network. To SSH to the cluster, open a Terminal emulator (like the Terminal in VS Code, or PowerShell on Windows or the Terminal application on MacOS) and enter:

  $ ssh <yourusername>@hpc.grinnell.edu

Then enter your password when prompted.

The SLURM Scheduler

Compute jobs on the cluster are managed by a scheduler: SLURM. To run jobs on the cluster you'll need to prepare your jobs then submit them to the scheduler along with some information about the resources that are needed to run the job. (Refer to the SLURM Quickstart guide for more information on using and interacting with the SLURM scheduler.)

Jobs can be submitted to the scheduler using the srun command. srun can be used with arguments to set parameters directly, or can be used with a script that sets parameters and defines the job to be executed. At this time, each node on the Grinnell HPC cluster has two CPUs (sockets) with 10 cores each. If you need to submit a job that requires 16 cores, you would need to submit a job that requests 1 node, 2 sockets, and 8 cores per socket. If you need to submit a job that requires 32 cores, you would request 2 nodes, 2 sockets per node, 8 cores per socket. Alternatively, SLURM also uses the concept of tasks so rather than specifying sockets per node and cores per socket, you can specify tasks per node. For a 32 core job you could request 2 nodes, 16 tasks per node.

These are common parameters passed to SLURM when submitting a job:

  • Set Working Directory (-D) The remote process will change into this directory before running
  • Nodes (-N),
  • Sockets (or CPUs) per node (–sockets-per-node)
  • Cores (–cores-per-socket)
  • Tasks (–tasks-per-node)
  • Memory required per node (–mem)
  • Time limit (-t=NN:NN) Set a time limit on the runtime - a job that exceeds this limit may be killed
  • Output Path (-o) The path for the output file of the job
  • Error Path (-e) The path for the error file of the job
  • Name (-J) A name for the job

A complete list of parameters that srun accepts is available in the manpage for srun.

Basic job submission

Consider this command. If you enter this command on the command line of a Terminal, it will print the date, wait 5 seconds, then print the date again.

  
  $ date; sleep 5; date;

To execute this simple series of commands you can create a very simple script:

File date.sub:

  #!/bin/bash
  date
  sleep 5
  date

The commands are now in a format that can be submitted to the scheduler to be run as a job on the cluster. But we also need to tell the scheduler what resources are needed. This information can be passed to SLURM directly in the srun command or it can be placed in the script itself.

The job can be submitted to the scheduler by adding flags to the srun command to request one node and one task per node:

  $ srun -J SleepJob -N 1 --tasks-per-node=1 date.sub &

To place the resource request in the script itself, we modify the script to include additional lines for SLURM:

File date.sub:

  #!/bin/bash
  #SBATCH -N 1
  #SBATCH --tasks-per-node=1
  #SBATCH -J SleepJob
  
  date
  sleep 5
  date

Then the job is submitted with srun:

  $ srun date.sub &

The srun command will return a number for your job, the JOBID. Now that the job is in the queue, you can check the status of the job using the squeue command:

  $ squeue --job <JOBID>

-or-

  $ squeue-n <Job name if specified upon submission>

Next Steps

A step-by-step tutorial for creating and submitting a job using Open OnDemand is available here.

Comprehensive documentation for SLURM, including various tutorials is available on the Schedmd.com website.