quickstart

This is an old revision of the document!


Cluster Introduction

Faculty and their research students can request access to the cluster by contacting Mike Conner [connerms], Linux System Administrator, Information Technology Services. Access is granted through the Login, or Master node. The Master node can be used to manage files, prepare jobs, and submit jobs to the scheduler. The scheduler is software that matches all the incoming jobs with their various requirements, matches those jobs to compute resources, and determines when the jobs will be run. Jobs are submitted to the scheduler using a submission script (or submission file) that includes parameters that are passed to the scheduler (about where to place the output of the job, how long the job is expected to run, how many cores are needed, etc) and the actual workload for the compute nodes (i.e. the program or script to run). Jobs are placed in a queue and are run when the necessary resources become available. Policies configured in the scheduler also try to balance priorities, reduce time waiting in the queue, and maximize the use of the resources.

To prepare jobs for the scheduler, users write a shell script, the submission script, that sets all necessary variables and contains all commands to be run.

Connecting to the Grinnell High Performance Compute Cluster

After obtaining an account on the cluster, the easiest way to connect to the cluster and begin setting up your workload and submitting jobs is to direct your browser to https://hpc.grinnell.edu. You'll need to login when prompted using your Grinnell College credentials.

Open OnDemand will give you easy access to tools to create and submit jobs, manage your files on the cluster, even access an interactive shell on the cluster. Open OnDemand (web-based) connections are restricted to network connections that are wired, on campus, or connected to the Grinnell College secure wireless network.

Cluster users may also connect to the cluster via SSH. SSH connections are restricted to network connections that are wired, on campus, or connected to the Grinnell College secure wireless network. To SSH to the cluster, open a Terminal emulator (like the Terminal in VS Code, or PowerShell on Windows or the Terminal application on MacOS) and enter:

  $ ssh <yourusername>@hpc.grinnell.edu

Then enter your password when prompted.

The SLURM Scheduler

Compute jobs on the cluster are managed by a scheduler: SLURM. To run jobs on the cluster you'll need to prepare your jobs then submit them to the scheduler along with some information about the resources that are needed to run the job. (Refer to the SLURM Quickstart guide for more information on using and interacting with the SLURM scheduler.)

Jobs can be submitted to the scheduler using the srun command. srun can be used with arguments to set parameters directly, or can be used with a script that sets parameters and defines the job to be executed. At this time, each node on the Grinnell HPC cluster has two CPUs (sockets) with 10 cores each. If you need to submit a job that requires 16 cores, you would need to submit a job that requests 1 node, 2 sockets, and 8 cores per socket. If you need to submit a job that requires 32 cores, you would request 2 nodes, 2 sockets per node, 8 cores per socket. Alternatively, SLURM also uses the concept of tasks so rather than specifying sockets per node and cores per socket, you can specify tasks per node. For a 32 core job you could request 2 nodes, 16 tasks per node.

These are common parameters passed to SLURM when submitting a job:

  • Set Working Directory (-D) The remote process will change into this directory before running
  • Nodes (-N),
  • Sockets (or CPUs) per node (–sockets-per-node)
  • Cores (–cores-per-socket)
  • Tasks (–tasks-per-node)
  • Memory required per node (–mem)
  • Time limit (-t=NN:NN) Set a time limit on the runtime - a job that exceeds this limit may be killed
  • Output Path (-o) The path for the output file of the job
  • Error Path (-e) The path for the error file of the job
  • Name (-J) A name for the job

A complete list of parameters that srun accepts is available in the ''srun'' manpage.

As an example, consider this command. It will print the date, wait 5 seconds, then print the date again.

  $ date; sleep 5; date;

To execute this job on the cluster, we would turn this command in to a very simple script:

File `date.sub`:

  #!/bin/bash
  date
  sleep 5
  date

Once the script is created it can submitted to the scheduler. But we also need to tell the scheduler what resources are needed and how long the job will likely take. This information can be passed to Moab directly in the `msub` command or it can be placed in the script itself.

The job can be submitted to the scheduler adding flags to the `msub` command to request one node, one core, and specify a duration (i.e. walltime) of 6 seconds.

  $ msub -l nodes=1:ppn=1,walltime=00:06 date.sub

To place this request in the script itself, we modify the script to include an additional line with the resources request:

File `date.sub`:

  #!/bin/bash
  #PBS -l nodes=1:ppn=1,walltime=02:00
  
  date
  sleep 5
  date

Then the job is submitted with `msub`:

  $ msub date.sub

The msub command will return a number for your job, the JOBID. Now that the job is in the queue, you can check the status of the job using the showq command or the checkjob command:

  $ showq

-or-

  $ checkjob JOBID

Refer to the Moab Documentation for complete list of parameters that can be fed to the scheduler.