Running Jobs with Slurm
Overview
Teaching: 25 min
Exercises: 30 minQuestions
How do I run a job with the Slurm scheduler?
Objectives
Understand how to submit an interactive job using Slurm
Understand how to submit a batch job using Slurm
Understand how to set the parameters for your Slurm job
Understand the concept of job arrays
Know how to get email alerts from Slurm
Working with the scheduler
The scheduler is responsible for listening to your job requests, then finding the proper compute node that meets your job’s resource requirements – RAM, number of cores, time, etc. It dispatches the job to a compute node, collects info about the completed work, and stores information about your job. If you’ve asked it to do so, it will even notify you about the status of your job (e.g. begin, end, fail, etc).
Running interactive jobs
There are two ways to run jobs on a cluster. One, usually done at the start, is to get an interactive/foreground session on a compute node. This will give you a command prompt on a compute node and let you run the commands of your choice there.
If you want to experiment with some code and test it you should run it this way. ** Don’t run jobs on the login nodes ** – the two 40-core login nodes are dedicated to letting users connect and compile their software; their resources are negligible compared to the 5,000 CPU cores on the compute nodes.
To get an interactive session, you first need to issue a salloc
command to reserve some resources.
salloc --nodes=1 --account=scwXXXX --reservation=scwXXXX_YY --partition=development
(You will need to replace XXXX
to match the account ID and YY
to match the reservation ID given by your instructor.)
The –partition=development will launch your job in the development partition which has a 30 minute time limit. This is only available on Sunbird in Swansea,
on Hawk in Cardiff use –partition=dev instead.
The salloc command will respond now with a job ID number.
salloc: Granted job allocation 21712
salloc: Waiting for resource configuration
salloc: Nodes scs0018 are ready for job
Acccounts and Reservations
We can optionally specify an account and reservation ID to Slurm. The account ID tells the system which project your job will be accounted against, if you are a member of multiple projects some might have different priorities and limitations.
A reservation is where some compute nodes have been reserved for a particular project at a particular time. To ensure nodes are available for this course we may have obtained a reservation. Your instructor will tell you which acccount and reservation to use here. The account can be specified either through the
--account
option tosalloc
(and to thesbatch
command which we’ll use soon) and the reservation through the--reservation
option. Alternatively, these can be specified in theSALLOC_ACCOUNT
,SBATCH_ACCOUNT
,SALLOC_RESERVATION
andSBATCH_RESERVATION
environment variables.
We have now allocated ourselves a host to run a program on. The -n 1
tells Slurm how many copies of the task we will be running. The
--ntasks-per-node=1
tells Slurm that we will just be running one
task for every node we are allocated. We could increase either of
these numbers if we want to run multiple copies of a task and if we
want to run more than one copy per node.
The --account
option tells Slurm which project to account your usage
against, if you are only a member of one project then this will
default to that project. If you’re a member of multiple projects then
you must specify this—if you do not, the job will fail to submit,
and you will receive an error reminding you to set it. The accounting
information is used to measure what resources a project has consumed
and to prioritise its use, so its important to choose the right
project.
To ensure nodes are available for this training workshop a reservation
may have been made to prevent anyone else using a few nodes. In order
to make use of these you must use the --reservation
option too; if
you don’t then you’ll have to wait in the same queue as everyone else.
To actually run a command we now need to issue the srun
command. This also takes a -n
parameter to tell Slurm how many
copies of the job to run and it takes the name of the program to
run. To run a job interactively we need another argument: --pty
.
srun --pty /bin/bash
If you run command above you will see the hostname in the prompt
change to the name of the compute node that Slurm has allocated to
you. In the example below the compute node is called scs0018
.
[s.jane.doe@sl1 ~]$ srun --pty /bin/bash
[s.jane.doe@scs0018 ~]$
We are now logged into a compute node and can run any commands we wish and these will run on the compute node instead of the login node. You can also confirm the name of the host that you are connected to by running the hostname
command.
hostname
scs0018
Once we are done working with the compute node we need to disconnect
from it. The exit
command will exit the bash program on the compute
node causing us to disconnect. After this command is issued the
hostname prompt should change back to the login node’s name
(e.g. sl1
or cl1
).
[s.jane.doe@scs0018 ~]$ exit
exit
[s.jane.doe@sl1 ~]$
At this point we still hold an allocation for a node, and could run
another job on it if we wished. We can confirm this by examining the
queue of our jobs with the squeue
command. By default this will show
all users’ jobs, which is a bit overwhelming, so we can use --user=
and our username to filter to our own jobs only.
squeue --user=s.jane.doe
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
21712 compute bash s.jane.doe R 3:58 1 scs0018
To relinquish the node allocation we need to issue another exit command.
[s.jane.doe@sl1 ~]$ exit
This will display a message that the job allocation is being relinquished and show us the same job ID number again.
exit
salloc: Relinquishing job allocation 21712
At this point our job is complete and we no longer hold any allocations. We can confirm this again with the squeue
command.
[s.jane.doe@sl1 ~]$ squeue --user=s.jane.doe
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Running batch jobs
For most situations we won’t want to run a job interactively, instead we will want to submit it to the cluster have it do its work and return any output files to us. We might submit many copies of the same job with different parameters in this way. This method of working is known as batch processing. To do this we must first write some details about our job into a script file. Use nano
(or your favourite command line text editor) and create a file containing the following:
nano batchjob.sh
#!/bin/bash --login
###
# job name
#SBATCH --job-name=hostname
# job stdout file
#SBATCH --output=hostname.out.%J
# job stderr file
#SBATCH --error=hostname.err.%J
# maximum job time in D-HH:MM
#SBATCH --time=0-00:01
# maximum memory of 10 megabytes
#SBATCH --mem-per-cpu=10
# run a single task, using a single CPU core
#SBATCH --ntasks=1
# specify our current project
# change this for your own work
#SBATCH --account=scwXXXX
# specify the reservation we have for the training workshop
# replace XX with the code provided by your instructor
# remove this for your own work
#SBATCH --reservation=scwXXXX_YY
# Specify the development partition, this will give out job a maximum of 30 minutes to run
#SBATCH --partition=development
###
/bin/hostname
This is actually a bash script file containing all the commands that
will be run. Lines beginning with a #
are comments which bash will
ignore. However lines that begin #SBATCH
are instructions for the
sbatch
program. The first of these (#SBATCH --job-name=hostname
)
tells sbatch the name of the job; in this case we will call the job
hostname. The --output
line tells sbatch where output from the
program should be sent; the %J
in its name means the job
number. Including the job number in the output filename means that
repeated runs of the same script won’t overwrite the output file. The
same applies for the --error
line, except here it is for error
messages that the program might generate, in most cases this file will
be blank. The --time
line limits how long the job can run for, this
is specified in days, hours and minutes. The --mem-per-cpu
tells
Slurm how much memory to allow the job to use on each CPU it runs on,
if the job exceeds this limit Slurm will automatically stop it. You
can set this to zero for no limits. However by putting in a sensible
number you can help allow other jobs to run on the same
node. --account
and --reservation
tell Slurm to count usage
against the project created for training workshops, and to use the
nodes that have been reserved for today’s training. (The value after
--reservation
changes for each workshop; your instructor will give
you a value to use.) The final line specifies the actual commands
which will be executed, in this case its the hostname
command which
will tell us the name of the compute node which ran our job.
Running on your own
If you are following this guide outside of a training workshop, or want to adjust the script to use in your own research, you will need to remove the
--reservation
line, since there won’t be a reservation in place, and change--account
to the project code of an approved project that you are a member of.
Lets go ahead and submit this job with the sbatch
command.
[s.jane.doe@sl1 ~]$ sbatch batchjob.sh
sbatch will respond with the number of the job.
Submitted batch job 3739464
Our job should only take a couple of seconds to run, but if we are fast we might see it in the squeue
list.
[s.jane.doe@sl1 ~]$ squeue --user=s.jane.doe
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3739464 work hostname jane.doe R 0:01 1 scs0018
Once the job is completed two new files should be created, one called hostname.out.3739464
and one called hostname.err.3739464
. The .out
file is the output from the command we ran and the .err
is the errors from that command. Your job files will have different names as they will contain the job ID you were allocated and not 3739464
. Lets go ahead and look at the .out
file:
[s.jane.doe@sl1 ~]$ cat hostname.out.373464
scs0018
If we check the .err
file it should be blank:
[s.jane.doe@sl1 ~]$ cat hostname.err.373464
Over-riding the sbatch options from the command line
As well as specifying options to sbatch in the batch file, they can specified on the command line too.
Lets edit our batch file to run the command /bin/sleep 70
before /bin/hostname
, this will cause it to wait for 70 seconds before exiting. As our job has a one minute limit this should fail and the hostname output will never happen.
[s.jane.doe@sl1 ~]$ nano batchjob.sh
Edit the script to have the command /bin/sleep 70
before the hostname
command.
#!/bin/bash --login
###
# job name
#SBATCH --job-name=hostname
# job stdout file
#SBATCH --output=hostname.out.%J
# job stderr file
#SBATCH --error=hostname.err.%J
# maximum job time in D-HH:MM
#SBATCH --time=0-00:01
# maximum memory of 10 megabytes
#SBATCH --mem-per-cpu=10
# run a single task, using a single CPU core
#SBATCH --ntasks=1
# specify our current project
# change this for your own work
#SBATCH --account=scwXXXX
# specify the reservation we have for the training workshop
# remove this for your own work
# replace XXXX and YY with the code provided by your instructor
#SBATCH --reservation=scwXXXX_YY
# Specify the development partition, this will give out job a maximum of 30 minutes to run
#SBATCH --partition=development
###
/bin/sleep 70
/bin/hostname
Now lets resubmit the job.
[s.jane.doe@sl1 ~]$ sbatch batchjob.sh
Submitted batch job 3739465
After approximately one minute the job will disappear from the squeue
output, but this time the .out
file should be empty and the .err
file will contain an error message saying the job was cancelled:
[s.jane.doe@sl1 ~]$ cat hostname.err.3739465
slurmstepd: error: *** JOB 3739465 ON scs0018 CANCELLED AT 2017-12-06T16:45:38 DUE TO TIME LIMIT ***
[s.jane.doe@sl1 ~]$ cat hostname.out.3739465
Now lets override the time limit by giving the parameter --time 0-0:2
to sbatch, this will set the time limit to two minutes and the job should complete.
[s.jane.doe@sl1 ~]$ sbatch --time 0-0:2 batchjob.sh
Submitted batch job 3739466
After approximately 70 seconds the job will disappear from the squeue list and this we should have nothing in the .err
file and a hostname in the .out
file.
[s.jane.doe@sl1 ~]$ cat hostname.err.3739466
[s.jane.doe@sl1 ~]$ cat hostname.out.3739466
scs0018
Cancelling jobs
The scancel
command can be used to cancel a job after its submitted. Lets go ahead and resubmit the job we just used.
[s.jane.doe@sl1 ~]$ sbatch batchjob.sh
Submitted batch job 3739467
Now (within 60 seconds) lets cancel the job.
[s.jane.doe@sl1 ~]$ scancel 3739467
This will cancel the job, squeue
will now show no record of it and there won’t be a .out
or .err
file for it.
Listing jobs that have run
The sacct
command lists all the jobs you have run. By default this shows the Job ID, the Job Name, partition, Account, number of CPUs used, the state of the job and how long it ran for.
[s.jane.doe@sl1 ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
21713 bash compute scw1000 1 COMPLETED 0:0
21713.extern extern scw1000 1 COMPLETED 0:0
21713.0 bash scw1000 1 COMPLETED 0:0
21714 bash compute scw1000 1 COMPLETED 0:0
21714.extern extern scw1000 1 COMPLETED 0:0
21714.0 bash scw1000 1 COMPLETED 0:0
21716 hostname compute scw1000 1 COMPLETED 0:0
21716.batch batch scw1000 1 COMPLETED 0:0
21716.extern extern scw1000 1 COMPLETED 0:0
In the output above the account is the project you are associated with. scw1000
is the RSE project. You’ll probably see a different project account here.
Using the
sbatch
command.If you haven’t done so already:
- Write a submission script to run the hostname command on one node, with one core using one megabyte of RAM and a maximum run time of one minute. Have it save its output to
hostname.out.%J
and errors tohostname.err.%J
.- Run your script using
sbatch
- Examine the output file, which host did it run on?
- Try running it again, did your command run on the same host?
- Now add the command
/bin/sleep 70
before the line running hostname in the script. Run the job again and examine the output ofsqueue
as it runs. How many seconds does the job run for before it ends? Hint: the commandwatch -n 1 squeue
will run squeue every second and show you the output. Press CTRL+C to stop it.- What is in the
.err
file, why did you script exit? Hint: if it wasn’t due to the time expiring try altering another parameter so it is due a time expiration.
Running multiple copies of a job with srun
So far we’ve only run a single copy of a program. Often we’ll need to run multiple copies of something. To do this we can combine the sbatch
and srun
commands. Instead of just placing the command at the end of the script we’ll srun
the command.
This will allow multiple copies of the command to run. In the example below two copies of the hostname command are run on two different nodes.
#!/bin/bash --login
###
# job name
#SBATCH --job-name=hostname
# job stdout file
#SBATCH --output=hostname.out.%J
# job stderr file
#SBATCH --error=hostname.err.%J
# maximum job time in D-HH:MM
#SBATCH --time=0-00:01
# maximum memory of 10 megabytes
#SBATCH --mem-per-cpu=10
# run a two tasks
#SBATCH --ntasks=2
# run the tasks across two nodes; i.e. one per node
#SBATCH --nodes=2
# specify our current project
# change this for your own work
#SBATCH --account=scwXXXX
# specify the reservation we have for the training workshop
# remove this for your own work
# replace XX with the code provided by your instructor
#SBATCH --reservation=scwXXXX_YY
# Specify the development partition, this will give out job a maximum of 30 minutes to run
#SBATCH --partition=development
###
srun /bin/hostname
Save this as batchjob_parallel.sh
and run it with sbatch
[s.jane.doe@sl1 ~]$ sbatch batchjob_parallel.sh
The output will now go into hostname.out.jobnumber
and should contain two different hostnames.
Job Arrays
Job Arrays are another method for running multiple copies of the same job. The --array
parameter to sbatch allows us to make use of this feature.
[s.jane.doe@sl1 ~]$ sbatch --array=0-2 batchjob.sh
The above command will submit three copies of the batchjob.sh command.
[s.jane.doe@sl1 ~]$ squeue --user=s.jane.doe
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3739590_0 compute hostname s.jane.doe R 0:01 1 scs0018
3739590_1 compute hostname s.jane.doe R 0:01 1 scs0018
3739590_2 compute hostname s.jane.doe R 0:01 1 scs0096
Running squeue
as this is happening will show three distinct jobs, each with an _ followed by a number on the end of their job ID. When the jobs are complete there will be three output and three err files all with the job ID and the job array number.
[s.jane.doe@sl1 ~]$ ls -rt | tail -6
hostname.out.3739592
hostname.out.3739591
hostname.out.3739590
hostname.err.3739590
hostname.err.3739592
hostname.err.3739591
Its possible for programs to get hold of their array number from the $SLURM_ARRAY_TASK_ID
environment variable. If we add the command echo $SLURM_ARRAY_TASK_ID
to our batch script then it will be possible to see this in the output file.
Choosing the proper resources for your job
When you submit a job, you are requesting resources from the scheduler to run your job. These are:
- time
- memory
- Number of cores (CPUs)
- Number of nodes
- (sort of) queue or group of machines to use
Choosing resources is like playing a game with the scheduler: You want to request enough to get your job completed without failure, But request too much: your job is “bigger” and thus harder to schedule. Request too little: if your job goes over that requested, it is killed. So you want to get it just right, and pad a little for wiggle room.
Another way to think of ‘reserving’ a compute node for you job is like making a reservation at a restaurant:
- if you bring more guests to your dinner, there won’t be room at the restaurant, but the wait staff may try to fit them in. If so, things will be more crowded and the kitchen (and the whole restaurant) may slow down dramatically
- if you bring fewer guests and don’t notify the staff in advance, the extra seats are wasted; no one else can take the empty places, and the restaurant may lose money.
“Never use a piece of software for the first time without looking to see what command-line options are available and what default parameters are being used” —Keith Bradnam, acgt.me
Time
This is determined by test runs that you do on your code during an interactive session. Or, if you submit a batch job, over-ask first, check the amount of time actually needed, then reduce time on later runs. The Supercomputing Wales hubs have a limit of three days maximum to run a job
Please! Due to scheduler overhead, bundle commands for minimum of 10 minutes / job
Memory:
We recommend that you check the software docs for memory
requirements. But since these are not stated, we can take another
approach. On the Supercomputing Wales hubs, each job is allowed, on
average, 9 GB RAM/core allocated. So, try 3 GB and do a trial run via
srun
or sbatch
. If your job was killed, look at your log files or
sacct
. If it shows a memory error, you went over. Ask for more and
try again.
Once the job has finished, ask the scheduler how much RAM was used by using the sacct
command to get post-run job info:
sacct -j JOBID --format=JobID,JobName,ReqMem,MaxRSS,Elapsed # RAM requested/used!!
The ReqMem
field is how much you asked for and MaxRSS
is how much was actually used. Now go back and adjust your RAM request in your sbatch command or submission script.
Number of Cores
You can tell Slurm how many cores you expect your software to use with the -n
or --ntasks
arguments. Setting this to more than one doesn’t cause multiple copies of your job to run.
This is determined by your software, how anxious you are to get the work done, and how well your code scales. NOTE! Throwing more cores at a job does not make it run faster! This is a common mistake and will waste compute time and prevent other users from running jobs. Ensure your software can use multiple cores: Inspect the parameters for your software and look for options such as ‘threads’, ‘processes’, ‘cpus’; this will often indicate that it has been parallelized. Then run test jobs to see how well it performs with multiple cores, inching slowing from 1 to 2, 4, 8, etc, assessing the decrease in time for the job run as you increase cores. Programs often do not scale well – it’s important to understand this so you can choose the appropriate number.
Number of Nodes
For most software, this choice is simple: 1. There are very few software packages capable of running across multiple nodes. If they are capable, they will probably mention the use of technology called ‘MPI’ or ‘openMPI’. Please talk to your local Supercomputing Wales staff about how to run this. If you wish to set this use the --nodes
or -N
options to sbatch.
Partitions (Queues)
Partitions, or queues, are a grouping of computers to run a certain
profile of jobs. This could be maximum run time, number of cores used,
maximum amount of RAM, etc. On the Supercomputing Wales hubs each
unique configuration of systems has its own partition. Earlier on we
used the sinfo
command to list the state of the cluster, one of the
parameters this showed was the name of the partitions.
Here is the output of sinfo
on Sunbird in Swansea.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 3-00:00:00 1 fail scs0042
compute* up 3-00:00:00 1 drain* scs0004
compute* up 3-00:00:00 2 mix scs[0018,0065]
compute* up 3-00:00:00 84 alloc scs[0001-0003,0005-0017,0020-0035,0043-0046,0049-0064,0066-0072,0097-0114,0116-0122]
compute* up 3-00:00:00 34 idle scs[0019,0036-0041,0047-0048,0073-0096,0115]
gpu up 2-00:00:00 4 idle scs[2001-2004]
The partition name is listed in the first column. The *
next to the
compute
partition denotes that it is the default. We can see that in
total it contains 122 nodes. Each of these has 384GB of RAM and 40
Intel Xeon Scalable Silver cores. Meanwhile the gpu
queue contains
4 nodes, which have the same number of CPU cores and amount of RAM
as those in the compute
queue, but additionally have two NVIDIA
Tesla V100 GPUs each.
We can specify which partition a job runs in with the -p
or --partition
arguments to sbatch
. So for example the following command will run our batch job on the compute partition.
sbatch --partition compute batchjob.sh
We could also add the following line to the batch submission script.
#SBATCH --partition compute
Email Alerts from Slurm
You can receive email alerts when your job begins and ends by adding the following to your Slurm script. Set --mail-type
to END
if you just want to be alerted about jobs completing.
#SBATCH --mail-user=abc1@aber.ac.uk
#SBATCH --mail-type=ALL
Exercises
Getting email output from
sbatch
- Add the following lines to your script from the previous exercise:
#SBATCH --mail-user=abc1@aber.ac.uk
(change to your own email address)#SBATCH --mail-type=ALL
- Submit the script with the
sbatch
command. You should get an email when the job starts and finishes.
Using job arrays
- Add the following to the end of your job script
echo $SLURM_ARRAY_TASK_ID
- Submit the script with the
sbatch --array=0-1
command.- When the job completes look at the output file. What does the last line contain?
- Try resubmitting with different array numbers, for example
10-11
. Be careful not to create too many jobs.- What use is it for a job to know its array number? What might it do with that information?
- Try looking at the variables
$SLURM_JOB_ID
and$SLURM_ARRAY_JOB_ID
what do these contain?Solution
- The last line of the output files should contain 0 and 1
- If you set the array IDs to 10 and 11 thne the output should contain 10 and 11.
- Its useful to act as a parameter for partitioning datasets that each job will process a subpart of.
- The job ID and the parent ID of the array job. The parent ID is usually one less than ID of the first array job.
More information about Slurm
- Supercomputing Wales Documentation
- Harvard University’s list of common Slurm commands
- For those coming from another cluster/scheduler, check out Slurm’s scheduler Rosetta stone
Key Points
Interactive jobs let you test out the behaviour of a command, but aren’t pratical for running lots of jobs
Batch jobs are suited for submitting a job to run without user interaction.
Job arrays are useful to submit lots of jobs.
Slurm lets you set parameters about how many processors or nodes are allocated, how much memory or how long the job can run.
Slurm can email you when a job starts or finishes.