Optimising workflow performance

Last updated on 2025-10-29 | Edit this page

Overview

Questions

What compute resources are available on my system?
How do I define jobs with more than one thread?
How do I measure the compute resources being used by a workflow?
How do I run my workflow steps in parallel?

Objectives

Understand CPU, RAM and I/O bottlenecks
Understand the threads declaration
Use common Unix tools to look at resource usage

Processes, threads and processors

Some definitions:

Process: A running program (in our case, each Snakemake job can be considered one process)
Threads: Each process has one or more threads which run in parallel
Processor: Your computer has multiple CPU cores or processors, each of which can run one thread at a time

These definitions are a little simplified, but fine for our needs. The operating system kernel shares out threads among processors:

Having fewer threads than processors means you are not fully using all your CPU cores
Having more threads than processors means threads have to “timeslice” on a core which is generally suboptimal

If you tell Snakemake how many threads each rule will use, and how many cores you have available, it will start jobs in parallel to use all your cores. In the diagram below, five jobs are ready to run and there are four system cores.

Representation of a computer with four microchip icons indicating four available cores. To the right are five small green boxes representing Snakemake jobs and labelled as wanting 1, 1, 1, 2 and 8 threads respectively.

Listing the resources of your machine

So, to know how many threads to make available to Snakemake, we need to know how many CPU cores we have on our machine. On Linux, we can find out how many cores you have on a machine with the lscpu command.

$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  Model name:             Cortex-A72
    Model:                3
    Thread(s) per core:   1

There we can see that we have four CPU cores, each of which can run a single thread.

On macOS meanwhile, we use the command sysctl -n hw.ncpu:

$ sysctl hw.ncpu
hw.ncpu: 8

In this case, we see that this Mac has eight cores.

Likewise find out the amount of RAM available:

BASH

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       1.1Gi       110Mi        97Mi       2.6Gi       2.6Gi
Swap:          199Mi       199Mi        60Ki

In this case, the machine has 3.7GiB of total RAM.

On macOS, the command is sysctl -h hw.memsize:

$ sysctl -h hw.memsize
hw.memsize: 34,359,738,368

This machine has around 34GB of RAM in total. Dividing by the number of bytes in 1GiB (\(1024^3\) bytes), that becomes 32GiB RAM.

We don’t want to use all of this RAM, but if we don’t mind other applications being unresponsive while our workflow runs, we can use the majority of it.

Finally, to check the available disk space, on the current partition:

BASH

$ df -h .

(or df -h without the . to show all partitions) This is the same on both macOS and Linux.

Parallel jobs in Snakemake

You may want to see the relevant part of the Snakemake documentation.

We’ll force all the intermediary steps to re-run by using the --forceall flag to Snakemake and time the whole run using the time command.

BASH

$ time snakemake --cores 1 --use-conda --forceall -- assets/plots/spectrum.pdf
real	3m10.713s
user	1m30.181s
sys	0m8.156s

Challenge

Measuring how concurrency affects execution time

What is the wallclock time reported by the above command? We’ll work out the average for everyone present, or if you are working through the material on your own, repeat the measurement three times to get your own average.

Now change the Snakemake concurrency option to --cores 2 and then --cores 4. Finally, try using every available core on your machine, using --cores all.

How does the total execution time change?
What factors do you think limit the power of this setting to reduce the execution time?

Show me the solution

The time will vary depending on the system configuration but somewhere around 150–200 seconds is expected, and this should reduce to around 75–100 secs with --cores 2 but depending on your computer, higher --cores might produce diminishing returns.

Things that may limit the effectiveness of parallel execution include:

The number of processors in the machine
The number of jobs in the DAG which are independent and can therefore be run in parallel
The existence of single long-running jobs
The amount of RAM in the machine
The speed at which data can be read from and written to disk

There are a few gotchas to bear in mind when using parallel execution:

Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash.
Parallel jobs may trip over each other if they try to write to the same filename at the same time (this can happen with temporary files).
The on-screen output from parallel jobs will be jumbled, so save any output to log files instead.

Multi-thread rules in Snakemake

In the diagram at the top, we showed jobs with 2 and 8 threads. These are defined by adding a threads: block to the rule definition. We could do this for the ps_mass rule:

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    threads: 4
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

You should explicitly use threads: 4 rather than params: threads = "4" because Snakemake considers the number of threads when scheduling jobs. Also, if the number of threads requested for a rule is less than the number of available processors then Snakemake will use the lower number.

Snakemake uses the threads variable to set common environment variables like OMP_NUM_THREADS. If you need to pass the number explicitly to your program, you can use the {threads} placeholder to get it.

Callout

Fine-grained profiling

Rather than timing the entire workflow, we can ask Snakemake to benchmark an individual rule.

For example, to benchmark the ps_mass step we could add this to the rule definition:

rule ps_mass:
    benchmark:
        "benchmarks/ps_mass.beta{beta}.txt"
    ...

The dataset here is so small that the numbers are tiny, but for real data this can be very useful as it shows time, memory usage and IO load for all jobs.

Running jobs on a cluster

Learning about clusters is beyond the scope of this course, but can be essential for more complex workflows working with large amounts of data.

When working with Snakemake, there are two options to getting the workflow running on a cluster:

Similarly to most tools, we may install Snakemake on the cluster, write a job script, and execute Snakemake on our workflow inside a job.
We can teach Snakemake how to run jobs on the cluster, and run our workflow from our own computer, having Snakemake do the work of submitting and monitoring the jobs for us.

To run Snakemake in the second way, someone will need to determine the right parameters for your particular cluster and save them as a profile. Once this is working, you can share the profile with other users on the cluster, so discuss this with your cluster sysadmin.

Instructions for configuring the Slurm executor plugin can be found in the Snakemake plugin catalog, along with the drmaa, cluster-generic and cluster-sync plugins which can support PBS, SGE and other cluster schedulers.

A photo of some high performance computer hardware racked in five cabinets in a server room. Each cabinet is about 2.2 metres high and 0.8m wide. The doors of the cabinets are open to show the systems inside. Orange and yellow cabling is prominent, connecting ports within the second and third racks.

Callout

Cluster demo

A this point in the course there may be a cluster demo…

Key Points

To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
You also need to consider RAM, disk, and network bottlenecks
Profile your jobs to see what is taking most resources
Use --cores all to enable using all CPU cores
Snakemake is great for running workflows on compute clusters