A Guide on SLURM

Author: Gary Johnson & Oliver Baldwin Edwards

Date: 2023-01-04

Slurm is a system for managing and scheduling Linux clusters. It is open source, fault tolerant and scalable, suitable for clusters of various sizes.

When Slurm is implemented, it can perform these tasks:

Assign a user to a compute node. The access provided can be exclusive, with resources being limited to an individual user, or non-exclusive, with the resources shared among multiple users.

Provide the framework for launching and monitoring jobs on assigned nodes. The jobs are typically managed in parallel, running on multiple nodes.

Manage the pending job queue, determining which job is next in line to be assigned to the node.

Source for above descrpition

Query the Cluster

`squeue`: view the list of currently queued jobs with the active ones at the bottom of the list
`sinfo`: shows the status of all the compute nodes
`sinfo -R`: see the reasons that some nodes are in a failed state and time of failure
`scontrol show nodes`: shows detailed status of all the nodes
`scontrol show nodes `: shows detailed status of the selected nodes
`sudo scontrol update NodeName= State=RESUME`: kill the currently running job on a node and make it available again

Configure your Job Scripts

Include this directive in your shell scripts to tell SLURM how many nodes to allocate:

#SBATCH --cpus-per-task=128

FIXME: LIST THE MAIN USEFUL DIRECTIVES HERE WITH EXAMPLES

##----------------------------------------------------------
#SBATCH -J Job1         # Job Name
#SBATCH -N 1            # Total number of nodes requested
#SBATCH --exclusive     # Do not permit jobs to share nodes
#SBATCH --error=error_%j.err    # Error File
#SBATCH --output=out_%j.out     # Standard Output File

##----------------------------------------------------------
##insert required modules here
##----------------------------------------------------------

# Directory to store output
cd /mnt/hyperion/insert path to desired directory

# Executable
srun /insert path to executable here

For multiple jobs on one node each:

#!/bin/bash
##----------------------------------------------------------
## Testing Event 33 with 1 Node Each
## Kyra M. Bryant
## March 27th, 2022
##----------------------------------------------------------
#SBATCH -J Historical_Event    # Job name
#SBATCH -N 1                   # Total number of nodes requested
#SBATCH --exclusive            # Do not permit jobs to share nodes
#SBATCH --error=error_%a.err   # error file
#SBATCH --output=out_%a.out    # output file
#SBATCH --array=1-4            # 4 Jobs total
##----------------------------------------------------------
parfile_array[1]=parfile_1.par # Job 1
parfile_array[2]=parfile_2.par # Job 2
parfile_array[3]=parfile_3.par # Job 2
parfile_array[4]=parfile_4.par # Job 3

cd /mnt/hyperion/data/fathom/teaching/historical/event_33
srun /executable path ${parfile_array[$SLURM_ARRAY_TASK_ID]}

Run your Job Scripts

FIXME: LIST THE MAIN USEFUL COMMANDS HERE WITH EXAMPLES

`srun`: DESCRIBE ME
`sbatch`: DESCRIBE ME

Learn More

SLURM User Guide

SLURM Cheat Sheet

University of Michigan's SLURM Guide

RDLAB's SLURM Guide