Content from Running commands with Snakemake
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How do I run a simple command with Snakemake?
Objectives
- Create a Snakemake recipe (a Snakefile)
- Use Snakemake to compute the average plaquette of an ensemble
Introduction
Data analysis in lattice quantum field theory generally has many moving parts: you will likely have many ensembles, with differing physical and algorithmic parameters, and for each many different observables may be computed. These need to be combined in different ways, making sure that compatible statistics are used. Making sure that each step runs in the correct order is non-trivial, requiring careful bookkeeping, especially if we want to update data as ensembles are extended, if we want to take advantage of parallelism to get results faster, and if we want auditability to be able to verify later what steps were performed.
While we could build up tools to do all of these things from scratch, these are not challenges exclusive to lattice, and so we can take advantage of others’ work rather than reinventing the wheel. This frees up our time to focus on the physics challenges. The category of “tools to help run complex arrangements of tools in the right order” is called “workflow management”; there are workflow managers available, most of which are specialised to a specific class of applications.
One workflow manager developed for scientific data analysis is called Snakemake; this will be the target of this lesson. Snakemake is similar to GNU Make, in that you create a text file containing rules specifying how input files are translated to output files, and then the software will work out what rules to run to generate a specified output from the available input files. Unlike Make, Snakemake uses a syntax closely based on Python, and files containing rules can be extended using standard Python syntax. It also has many quality-of-life improvements compared to Make, and so is much better suited for writing data analysis workflows.
At this point, you should have Snakemake already installed and available to you. To test this, we can open a terminal and run
$ snakemake --version
8.25.3
If you instead get a “command not found” error, go back to the setup and check that you have completed all the necessary steps.
Looking at the sample data
You should already have the sample data files unpacked. (If not,
refer back to the lesson setup.) Under the
su2pg/raw_data
directory, you will find a series of
subdirectories, each containing data for a single ensemble. In each are
files containing the log of the configuration generation, the
computation of the quenched meson spectrum, and the computation of the
Wilson flow.
The sample data are for the SU(2) pure Yang-Mills theory, and have been generated using the HiRep code.
Each log contains header lines describing the setup, information on the computation being computed, and results for observables computed on each configuration. Code to parse these logs and compute statistics is included with the sample data; we’ll use these in due course
Making a Snakefile
To start with, let’s define a rule to count the number of lines in one of the raw data files.
Within the su2pg/workflow
directory, edit a new text
file named Snakefile
. Into it, insert the following
content:
rule count_lines:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.count"
shell:
"wc -l raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"
Key points about this file
- The file is named
Snakefile
- with a capitalS
and no file extension. - Some lines are indented. Indents must be with space characters, not tabs.
- The rule definition starts with the keyword
rule
followed by the rule name, then a colon. - We named the rule
count_trajectories
. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword. - The keywords
input
,output
,shell
are all followed by a colon. - The file names and the shell command are all in
"quotes"
.
The first line tells Snakemake we are defining a new rule. Subsequent
indented lines form a part of this rule; while there are none here, any
subsequent unindented lines would not be included in the rule. The
input:
line tells Snakemake what files to look for to be
able to run this rule. If this file is missing (and there is no rule to
create it), Snakemake will not consider running this rule. The
output:
line tells Snakemake what files to expect the rule
to generate. If this file is not generated, then Snakemake will abort
the workflow with an error. Finally, the shell:
block tells
Snakemake what shell commands to run to get the specified output from
the given input.
Going back to the shell now, we can test this rule. From the
su2pg
directory, we can run the command
snakemake --jobs 1 --forceall --printshellcmds intermediary_data/beta2.0/pg.count
If we’ve made any transcription errors in the rule (missing quotes, bad indentations, etc.), then it will become clear at this point, as we’ll receive an error that we will need to fix.
For now, we will consistently run snakemake
with the
--jobs 1 --forceall --printshellcmds
options. As we move
through the lesson, we’ll explain in more detail when we need to modify
them.
Let’s check that the output was correctly generated:
$ cat intermediary_data/beta2.0/pg.count
31064 raw_data/beta2.0/out_pg
Callout
You might have noticed that we are grouping files into directories
like raw_data
and intermediary_data
. It is
generally a good idea to keep raw input data separate from data
generated by the analysis. This means that if you need to run a clean
analysis starting from your input data, then it is much easier to know
what to remove and what to keep. Ideally, the raw_data
directory should be kept read-only, so that you don’t accidentally
modify your input data. Similarly, it is a good idea to separate out
“files that you want to include in a paper” from “intermediary files
generated by the workflow but not needed in the paper”; we’ll talk more
about that in a later section.
You might also worry that your tooling will need to use
mkdir
to create these directories; in fact, Snakemake will
automatically create all directories where it expects to see output from
rules that it runs.
In the first few episodes we always run Snakemake with the
--forceall
flag, and it’s not explained what this does
until Ep. 04. The rationale is that the default Snakemake behaviour when
pruning the DAG leads to learners seeing different output (typically the
message “nothing to be done”) when repeating the exact same command.
This can seem strange to learners who are used to scripting and
imperative programming.
The internal rules used by Snakemake to determine which jobs in the
DAG are to be run, and which skipped, are pretty complex, but the
behaviour seen under --forceall
is much more simple and
consistent; Snakemake simply runs every job in the DAG every time. You
can think of --forceall
as disabling the lazy evaluation
feature of Snakemake, until we are ready to properly introduce and
understand it.
Running Snakemake
Run snakemake --help | less
to see the help for all
available options. What does the --printshellcmds
option in
the snakemake
command above do?
- Protects existing output files
- Prints the shell commands that are being run to the terminal
- Tells Snakemake to only run one process at a time
- Prompts the user for the correct input file
Hint: you can search in the text by pressing /
, and
quit back to the shell with q
- Prints the shell commands that are being run to the terminal
This is such a useful thing we don’t know why it isn’t the default!
The --jobs 1
option is what tells Snakemake to only run one
process at a time, and we’ll stick with this for now as it makes things
simpler. The --forceall
option tells Snakemake to always
recreate output files, and we’ll learn about protected outputs much
later in the course. Answer 4 is a total red herring, as Snakemake never
prompts interactively for user input.
Counting trajectories
The count of output lines isn’t particularly useful. Potentially more interesting is the number of trajectories in a given file. In a HiRep generation log, each trajectory concludes with a line of the form
OUTPUT
[MAIN][0]Trajectory #1: generated in [39.717707 sec]
We can use grep
to count these, as
Counting sequences in Snakemake
Modify the Snakefile to count the number of
trajectories in raw_data/beta2.0/out_pg
,
rather than the number of lines.
- Rename the rule to
count_trajectories
- Keep the output file name the same
- Remember that the result needs to go into the output file, not just be printed on the screen
- Test the new rule once it is done.
rule count_trajectories:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.count"
shell:
"grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"
Key Points
- Before running Snakemake you need to write a Snakefile
- A Snakefile is a text file which defines a list of rules
- Rules have inputs, outputs, and shell commands to be run
- You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule
Content from Running Python code with Snakemake
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How do I configure an environment to run Python with Snakemake?
Objectives
- Create a Conda environment definition
- Use Snakemake to instantiate this environment and use it to run Python code
Why define an environment
Snakemake is written in Python, so anyone running a Snakemake workflow already has Python available. In principle, we could make use of this installation to run any Python code we need to run in our workflow. However, it’s more than likely we will need to make use of libraries that are not installed as part of the Snakemake installation. At that point, we would either need to install additional libraries into our Snakemake environment, which might be used by other analyses that need their own sets of libraries that could create conflicts, or to create a second Snakemake environment for this analysis. If different steps of our workflow need different, conflicting sets of libraries then this becomes more complicated again.
We would also like those trying to reproduce our work to be able to run using exactly the same software environment that we used in our original work. In principle, we could write detailed documentation specifying which packages to install; however, it is both more precise and more convenient to define the environment as a data file, which Conda can use to build the same environment every time.
Even better, we can tell Snakemake to use a specific Conda environment for each rule we define.
A basic environment definition
Conda environment definitions are created in YAML-format files. These specify what Conda packages are needed (including the target version of Python), as well as any Pip packages that are installed.
Callout
Some packages give you a choice as to whether to install using Conda or Pip. When working interactively with an environment, using Pip consistently typically reduces the chance of getting into a state where Conda is unable to install packages. That is less of a problem when constructing new environments from definition files, but even so, using Pip where possible will typically allow environments to resolve and install more quickly.
By convention, Conda environment definitions in Snakemake workflows
are placed in a workflow/envs/
directory. Let’s create
workflow/envs/analysis.yml
now, and place the following
content into it
YAML
name: su2pg_analysis
channels:
- conda-forge
dependencies:
- pip=24.2
- python=3.12.6
- pip:
- h5py==3.11.0
- matplotlib==3.9.2
- numpy==2.1.1
- pandas==2.2.3
- scipy==1.14.1
- uncertainties==3.2.2
- corrfitter==8.2
- -e ../../libs/su2pg_analysis
This will install the specified versions of h5py, Matplotlib, Numpy,
Pandas, Scipy, uncertainties, and corrfitter, as well as the analysis
tools provided in the libs
directory. The latter are
installed in editable mode, so if you need to modify them to fix bugs or
add functionality while working on your workflow, you don’t need to
remember to manually reinstall them.
Using an environment definition in a Snakefile
Now that we have created an environment file, we can use it in our
Snakefile to compute the average plaquette from a configuration
generation log. Let’s add the following rule to
envs/Snakefile
:
rule avg_plaquette:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.plaquette.json.gz"
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.plaquette raw_data/beta2.0/out_pg --output_file intermediary_data/beta2.0/pg.plaquette.json.gz"
The conda:
block tells Snakemake where to find the Conda
environment that should be used for running this rule. Let’s test this
now:
snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json.gz
We need to specify --use-conda
to tell Snakemake to pay
attention to the conda:
specification.
Let’s check now that the output was correctly generated:
$ cat intermediary_data/beta2.0/pg.plaquette.json.gz | gunzip | head -n 20
{
"program": "pyerrors 2.13.0",
"version": "1.1",
"who": "ed",
"date": "2025-05-20 15:44:09 +0100",
"host": "tsukasa.lan, macOS-15.4-arm64-arm-64bit",
"description": {
"group_family": "SU",
"num_colors": 2,
"nt": 48,
"nx": 24,
"ny": 24,
"nz": 24,
"beta": 2.0,
"num_heatbath": 1,
"num_overrelaxed": 4,
"num_thermalization": 1000,
"thermalization_time": 2453.811479,
"num_trajectories": 10010
},
Some of the output will differ on your machine, since this library tracks provenance, such as where and when the code was run, in the output file.
More plaquettes
Add a second rule to compute the average plaquette in the file
intermediary_data/beta2.2/out_pg
. Add this to the same
Snakefile you already made, under the avg_plaquette
rule,
and run your rules in the terminal. When running the
snakemake
command you’ll need to tell Snakemake to make
both the output files.
You can choose whatever name you like for this second rule, but it
can’t be avg_plaquette
, as rule names need to be unique
within a Snakefile. So in this example answer we use
avg_plaquette2
.
rule avg_plaquette2:
input: "raw_data/beta2.2/out_pg"
output: "intermediary_data/beta2.2/pg.plaquette.json.gz"
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.plaquette raw_data/beta2.2/out_pg --output_file intermediary_data/beta2.2/pg.plaquette.json.gz"
Then in the shell:
snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json intermediary_data/beta2.2/pg.plaquette.json
If you think writing a separate rule for each output file is silly, you are correct. We’ll address this next!
Content from Placeholders and wildcards
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How do I make a generic rule?
- How does Snakemake decide what rule to run?
Objectives
- Use Snakemake to compute the plaquette in any file
- Understand the basic steps Snakemake goes through when running a workflow
- See how Snakemake deals with some errors
Making rules more generic
In the previous two episodes, we wrote rules to count the number of generated trajectories in, and compute the average plaquette of, one ensemble. As a reminder, this was one such rule:
rule count_trajectories:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.count"
shell:
"grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"
When we needed to do the same for a second ensemble, we made a second copy of the rule, and changed the input and output filenames. This is obviously not scalable to large analyses: instead, we would like to write one rule for each type of operation we are interested in. To do this, we’ll need to use placeholders and wildcards. Such a generic rule might look as follows:
# Count number of generated trajectories for any ensemble
rule count_trajectories:
input: "raw_data/{subdir}/out_pg"
output: "intermediary_data/{subdir}/pg.count"
shell:
"grep -c generated {input} > {output}"
{subdir}
here is an example of a
wildcard Wildcards are used in the input
and output
lines of the rule to represent parts of
filenames. Much like the *
pattern in the shell, the
wildcard can stand in for any text in order to make up the desired
filename. As with naming your rules, you may choose any name you like
for your wildcards; so here we used subdir
, since it is
describing a subdirectory. If subdir
is set to
beta2.0
then the new generic rule will have the same inputs
and outputs as the original rule. Using the same wildcards in the input
and output is what tells Snakemake how to match input files to output
files.
If two rules use a wildcard with the same name then Snakemake will treat them as different entities—rules in Snakemake are self-contained in this way.
Meanwhile, {input}
and {output}
are
placeholders Placeholders are used in the
shell
section of a rule. Snakemake will replace them with
appropriate values before running the command: {input}
with
the full name of the input file, and {output}
with the full
name of the output file.
If we had wanted to include the value of the subdir
wildcard directly in the shell
command, we could have used
the placeholder {wildcards.subdir}
, but in most cases, as
here, we just need the {input}
and {output}
placeholders.
Let’s test this general rule now:
snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.count
As previously, if you see errors at this point, there is likely a problem with your Snakefile; check that all the rules match the ones that have appeared here, and that there aren’t multiple rules with the same name.
General plaquette computation
Modify your Snakefile so that it can compute the average plaquette for any ensemble, not just the ones we wrote specific rules for in the previous episode.
Test this with some of the values of \(\beta\) present in the raw data.
The replacement rule should look like:
# Compute average plaquette for any ensemble from its generation log
rule avg_plaquette:
input: "raw_data/{subdir}/out_pg"
output: "intermediary_data/{subdir}/pg.plaquette.json.gz"
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.plaquette {input} --output_file {output}"
To test this, for example:
snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta1.8/pg.plaquette.json.gz
Choosing the right wildcards
Our rule puts the sequence counts into output files named like
pg.count
. How would you have to change the
count_trajectories
rule definition if you wanted:
the output file for
raw_data/beta1.8/out_hmc
to beintermediary_data/beta1.8/hmc.count
?the output file for
raw_data/beta1.8/mass_fun-0.63/out_hmc
to beintermediary_data/beta1.8/mass_fun-0.63/hmc.count
?the output file for
raw_data/beta1.8/mass_fun-0.63/out_hmc
to beintermediary_data/hmc_b3.0_m-0.63.count
(forraw_data/beta1.9/mass_fun-0.68/out_pg
to beintermediary_data/hmc_b1.9_m-0.68.count
, etc.)?the output file for
raw_data/beta1.8/mass_fun-0.63/out_hmc
to beintermediary_data/hmc_m-0.63.count
(forraw_data/beta1.9/mass_fun-0.68/out_pg
to beintermediary_data/hmc_m-0.68.count
, etc.)?
(Assume that both pure-gauge and HMC logs tag generated trajectories the same way. Note that input files for the latter data are not included in the sample data, so these will not work as-is.)
In both cases, there is no need to change the shell
part
of the rule at all.
input: "raw_data/{subdir}/out_hmc"
output: "intermediary_data/{subdir}/hmc.count"
This can be done by changing only the static parts of the
input:
and output:
lines.
This in fact requires no change from the previous answer. The
wildcard {subdir}
can include /
, so can
represent multiple levels of subdirectory.
input: "raw_data/beta{beta}/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_b{beta}_m{mass}.count"
In this case, it was necessary to change the wildcards, because the
subdirectory name needs to be split to obtain the values of \(\beta\) and \(m_{\mathrm{fun}}\). The names chosen here
are {beta}
and {mass}
, but you could choose
any names, as long as they match between the input
and
output
parts.
This one isn’t possible, because Snakemake cannot
determine which input file you want to count by matching wildcards on
the file name intermediary_data/hmc_m-0.63.count
. You could
try a rule like this:
input: "raw_data/beta1.8/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_m{mass}.count"
…but it only works because \(\beta\)
is hard-coded into the input
line, and the rule will only
work on this specific sample, not other cases where other values of
\(\beta\) may be wanted. In general,
input and output filenames need to be carefully chosen so that Snakemake
can match everything up and determine the right input from the output
filename.
Filenames aren’t data
Notice that in some examples we can pull out the value of \(\beta\) from the name of the directory in which the file is located. However, ideally, we should avoid relying on this being correct. The name and location are useful for us to find the correct file, but we should try to ensure that the file contents also contain these data, and that we make use of those data in preference to the filename.
Snakemake order of operations
We’re only just getting started with some simple rules, but it’s worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases:
- Prepares to run:
- Reads in all the rule definitions from the Snakefile
- Plans what to do:
- Sees what file(s) you are asking it to make
- Looks for a matching rule by looking at the
output
s of all the rules it knows - Fills in the wildcards to work out the
input
for this rule - Checks that this input file is actually available
- Runs the steps:
- Creates the directory for the output file, if needed
- Removes the old output file if it is already there
- Only then, runs the shell command with the placeholders replaced
- Checks that the command ran without errors and made the new output file as expected
For example, if we now ask Snakemake to generate a file named
intermediary_data/wibble_1/pg.count
:
OUTPUT
$ snakemake --jobs 1 --forceall --printshellcmds intermediary_data/wibble_1/pg.count
Building DAG of jobs...
MissingInputException in line 1 of /home/zenmaster/data/su2pg/workflow/Snakefile:
Missing input files for rule count_trajectories:
output: intermediary_data/wibble_1/pg.count
wildcards: subdir=wibble_1
affected files:
raw_data/wibble_1/out_pg
Snakemake sees that a file with a name like this could be produced by
the count_trajectories
rule. However, when it performs the
wildcard substitution it sees that the input file would need to be named
raw_data/wibble_1/out_pg
, and there is no such file
available. Therefore Snakemake stops and gives an error before any shell
commands are run.
Dry-run (--dry-run
) mode
It’s often useful to run just the first two phases, so that Snakemake
will plan out the jobs to run, and print them to the screen, but never
actually run them. This is done with the --dry-run
flag,
eg:
We’ll make use of this later in the lesson.
The amount of checking may seem pedantic right now, but as the workflow gains more steps this will become very useful to us indeed.
Key Points
- Snakemake rules are made generic with placeholders and wildcards
- Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
- Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
- Snakemake checks for various error conditions and will stop if it sees a problem
Content from Chaining rules
Last updated on 2025-07-28 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- How do I combine rules into a workflow?
- How can I make a rule with multiple input files?
- How should I refer to multiple files with similar names?
Objectives
- Use Snakemake to filter and then count the sequences in a FASTQ file
- Understand how rules are linked by filename patterns
- Be able to use multiple input files in one rule
A pipeline of multiple rules
We have so far been able to count the number of generated
trajectories, and compute the average plaquette, given an output log
from the configuration generation. However, an individual average
plaquette is not interesting in isolation; what is more interesting is
how it varies between different values of the input parameters. To do
this, we will need to take the output of the avg_plaquette
rule that we defined earlier, and use it as input for another rule.
Let’s define that rule now:
# Take individual data files for average plaquette and plot combined results
rule plot_avg_plaquette:
input:
"intermediary_data/beta1.8/pg.plaquette.json.gz",
"intermediary_data/beta2.0/pg.plaquette.json.gz",
"intermediary_data/beta2.2/pg.plaquette.json.gz",
output:
"assets/plots/plaquette_scan.pdf"
conda: "envs/analysis.yml"
shell:
"python src/plot_plaquette.py {input} --output_filename {output}"
Callout
You can see that here we’re putting “files that want to be included
in a paper” in an assets
directory, similarly to the
raw_data
and intermediary_data
directories we
discussed in a previous episode. It can be useful to further distinguish
plots, tables, and other definitions, by using subdirectories in this
directory.
Rather than one input, as we have seen in rules so far, this rule
requires three. When Snakemake substitutes these into the
{input}
placeholder, it will automatically add a space
between them. Let’s test this now:
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf
Look at the logging messages that Snakemake prints in the terminal. What has happened here?
- Snakemake looks for a rule to make
assets/plots/plaquette_scan.pdf
- It determines that the
plot_avg_plaquette
rule can do this, if it hasintermediary_data/beta1.8/pg.plaquette.json.gz
,intermediary_data/beta2.0/pg.plaquette.json.gz
, andintermediary_data/beta2.2/pg.plaquette.json.gz
. - Snakemake looks for a rule to make
intermediary_data/beta1.8/pg.plaquette.json.gz
- It determines that
avg_plaquette
can make this ifsubdir=beta1.8
- It sees that the input needed is therefore
raw_data/beta1.8/out_pg
- Now Snakemake has reached an available input file, it runs the
avg_plaquette
rule. - It then looks through the other two \(\beta\) values in turn, repeating the process until it has all of the needed inputs.
- Finally, it runs the
plot_avg_plaquette
rule.
Here’s a visual representation of this process:
This, in a nutshell, is how we build workflows in Snakemake.
- Define rules for all the processing steps
- Choose
input
andoutput
naming patterns that allow Snakemake to link the rules - Tell Snakemake to generate the final output files
If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.
Choosing file name patterns
Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it, and most times there are several options that will work, but in all cases the file names you choose will need to be consistent and unabiguous.
Making file lists easier
In the rule above, we plotted the average plaquette for three values
of \(\beta\) by listing the files
expected to contain their values. In fact, we have data for a larger
number of \(\beta\) values, but typing
out each file by hand would be quite cumbersome. We can make use of the
expand()
function to do this more neatly:
input:
expand(
"intermediary_data/beta{beta}/pg.plaquette.json.gz",
beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
)
The first argument to expand()
here is a template for
the filename, and subsequent keyword arguments are lists of variables to
fill into the placeholders. The output is the cartesian product of all
the parameter lists.
We can check that this works correctly:
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf
Tabulating trajectory counts
The script src/tabulate_counts.py
will take a list of
files containing plaquette data, and output a LaTeX table of trajectory
counts. Write a rule to generate this table for all values of \(\beta\), and output it to
assets/tables/counts.tex
.
The replacement rule should look like:
# Output a LaTeX table of trajectory counts
rule tabulate_counts:
input:
expand(
"intermediary_data/beta{beta}/pg.count",
beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
)
output: "assets/tables/counts.tex"
conda: "envs/analysis.yml"
shell:
"python src/tabulate_counts.py {input} > {output}"
To test this, for example:
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/tables/counts.tex
Tabulating trajectory counts (continued)
This setup currently requires reading the value of \(\beta\) from the filename. Why is this not ideal? How would the workflow need to be changed to avoid this?
It’s easy for files to be misnamed when creating or copying them. Putting the wrong data into the file is harder, especially when it’s a raw data file generated by the same program as the rest of the data. (If the wrong value were given as input, this could happen, but the corresponding output data would also be generated at that incorrect value. Provided the values are treated consistently, the downstream analysis could in fact still be valid, just not exactly as intended.)
Currently, grep -c
is used to count the number of
trajectories. This would need to be replaced or supplemented with a tool
that read out the value of \(\beta\)
from the input log, and outputs it along with the trajectory count. The
src/tabulate_counts.py
script could then be updated to use
this number, rather than the filename.
In fact, the plaquette
module does just this; in
addition to the average plaquette, it also records the number of
trajectories generated as part of the metadata and provenance
information it tracks.
Key Points
- Snakemake links up rules by iteratively looking for rules that make missing inputs
- Careful choice of filenames allows this to work
- Rules may have multiple named input files (and output files)
- Use
expand()
to generate lists of filenames from a template
Content from Metadata and parameters
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Global parameters
Thus far, each of our rules has taken one or more input files, and given output files solely based on that. However, in some cases we may want to control options without having them within an input file.
For example, in the previous episode, we wrote a rule to plot a graph
using the script src/plot_plaquette.py
. The style of output
we got was good for a paper, but if we were producing a poster, or
putting the plot onto a slide with a dark background, we may wish to use
a different output style. The plot_plaquette.py
script
accepts a --styles
argument, to tell it what style file to
use to plot. One way to make use of this would be to add
--styles styles/paper.mplstyle
directly to the
shell:
block. However, if we had many such rules, and
wanted to switch from generating output for a paper to generating it for
a poster, then we would need to change the value in many places.
Instead, we can define a variable at the top of the Snakefile
plot_styles = "styles/jhep.mplstyle"
Then, when we use a script to generate a plot, we can update the
shell:
block of the corresponding rule similarly to
"python src/plot_plaquette.py {input} --output_filename {output} --plot_styles {plot_styles}"
Snakemake will substitute the value of the global
plot_styles
variable in place of the
{plot_styles}
placeholder.
We can test this by changing paper
to
poster
, and running
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf
We can see that the generated file now uses a different set of fonts.
Wilson flow
The tool su2pg_analysis.w0
computes the scale \(w_0\) given a log of the energy density
during evolution of the Wilson flow for an ensemble. To do this, the
reference scale \(\mathcal{W}_0\) needs
to be passed to the --W0
parameter. Use this, and the logs
stored in the files out_wflow
for each ensemble’s raw data
directory, to output the \(w_0\) scale
in a file wflow.w0.json
for each ensemble, taking the
reference value \(\mathcal{W_0} =
0.2\).
W0_reference = 0.2
# Compute w0 scale for single ensemble for fixed reference scale
rule w0:
input: "raw_data/{subdir}/out_wflow"
output: "intermediary_data/{subdir}/wflow.w0.json.gz"
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.w0 {input} --W0 {W0_reference} --output_file {output}"
Generating different filetypes
In addition to different plot styles, we may also wish to generate different filetypes. PDF is useful for including in LaTeX, but SVG may be a better format to use with some tools.
If we add a global definition:
plot_filetype = "pdf"
and update the output:
block of the rule as:
output:
"assets/plots/plaquette_scan.{plot_filetype}"
does this have the same effect as the example with
--styles
above?
(Hint: what happens when you try to make the targets
assets/plots/plaquette_scan.svg
and
assets/plots/plaquette_scan.txt
by specifying them at the
command line, without changing the value of
plot_filetype
?)
This can achieve a similar result, but in a slightly different way.
In the --styles
example, the {plot_styles}
string is in the shell:
block, and so directly looks up the
plot_styles
variable. (Recall that to look up a wildcard,
we needed to explicitly use wildcards.
.)
However, in this case the {plot_filetype}
string is in
the output:
block, so defines a wildcard. This may take any
value, so if we instruct snakemake
to produce
plaquette_scan.txt
, it will diligently pass that filename
to plot_plaquette.py
.
The plot_filetype = "pdf"
is in fact ignored. It could
however be used to set a default set of targets to generate, which we
will talk about in a later episode.
Metadata from a file
We would frequently like our rules to depend on data that are specific to the specific ensembles being analysed. For example, consider the rule:
# Compute pseudoscalar mass and amplitude with fixed plateau
rule ps_mass:
input: "raw_data/{subdir}/out_corr"
output: "intermediary_data/{subdir}/corr.ps_mass.json.gz"
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start 18 --plateau_end 23"
This rule hardcodes the positions of the start and end of the plateau region. In most studies, each ensemble and observable may have a different plateau position, so there is no good value to hardcode this to. Instead, we’d like a way of picking the right value from some list of parameters that we specify.
We could do this within the Snakefile, but where possible it is good to avoid mixing data with code. We shouldn’t need to modify our code every time we add or modify the data it is analysing. Instead, we’d like to have a dedicated file containing these parameters, and to be able to have Snakemake read it and pick out the correct values.
To do this, we can exploit the fact that Snakemake is an extension of Python. In particular, Snakemake makes use of the Pandas library for tabular data, which we can use to read in a CSV files. Let’s add the following to the top of the file:
import pandas
metadata = pandas.read_csv("metadata/ensemble_metadata.csv")
The file being read here is a CSV (Comma Separated Values) file. We can create, view, and modify this with the spreadsheet tool of our choice. Let’s take a look at the file now.

metadata/ensemble_metadata.csv
.You can see that we have columns defining metadata to identify each ensemble, and columns for parameters relating to the analysis of each ensemble.
Now, how do we tell Snakemake to pull out the correct value from this?
# Compute pseudoscalar mass and amplitude, read plateau from metadata
rule ps_mass:
input: "raw_data/beta{beta}/out_corr"
output: "intermediary_data/beta{beta}/corr.ps_mass.json.gz"
params:
plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end}"
We’ve done a couple of things here. Firstly, we’ve made explicit the
reference to \(\beta\) in the file
paths, so that we can use beta
as a wildcard, similarly to
in the challenge in the previous episode. Secondly, we’ve introduced a
params:
block. This is how we tell Snakemake about
quantities that may vary from run to run, but that are not filenames.
Thirdly, we’ve used the lookup()
function to search the
metadata
dataframe for the ensemble that we are
considering. Finally, we’ve used {params.plateau_start}
and
{params.plateau_end}
placeholders to use these parameters
in the shell command that gets run.
Let’s test this now:
snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json
cat intermediary_data/beta2.0/corr.ps_mass.json
OUTPUT
TODO
TODO CHALLENGE
Content from Multiple inputs and outputs
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Multiple outputs
Quite frequently, we will want a rule to be able to generate more than one file. It’s important we let Snakemake know about this, both so that it can instruct our tools on where to place these files, and so it can verify that they are correctly created by the rule. For example, when fitting a correlation function with a plateau region that we specify, it’s important to look at an effective mass plot to verify that the plateau actually matches what we assert. The rule we just wrote doesn’t do this—it only spits out a numerical answer. Let’s update this rule so that it can also generate the effective mass plot.
# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
input: "raw_data/beta{beta}/out_corr"
output:
data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
plot="intermediary_data/beta{beta}/corr.ps_eff_mass.pdf",
params:
plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles}"
Rather than having a single string after output:
, we now
have a block with two lines. Each line has the format
name=value
, and is followed by a comma. To make use of
these variables in our rule, we follow output
by a
.
, and then the name of the variable we want to use,
similarly to what we do for wildcards
and
params
.
Non-specificity
What happens if we define multiple named output:
variables like this, but refer to the {output}
placeholder
in the shell:
block without specifying a variable name?
(One way to find this out is to try echo {output}
as the
entire shell:
content; this will generate a missing output
error, but will first let you see what the output is.)
Snakemake will provide all of the defined output variables, as a
space-separated list. This is similar to what happens when an output
variable is a list, as we saw earlier when looking at the
expand()
function.
Flow plots
Update the Wilson flow \(w_0\) computation that we looked at in a previous challenge to also output the flow of \(\mathcal{W}(t)\), so that the shape of the flow may be checked.
rule w0:
input: "raw_data/{subdir}/out_wflow"
output:
data="intermediary_data/{subdir}/wflow.w0.json.gz",
plot="intermediary_data/{subdir}/wflow.W_flow.{plot_filetype}",
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.w0 {input} --W0 {W0_reference} --output_file {output} --plot_file {output.plot} --plot_styles {plot_styles}"
Multiple inputs
Similarly to outputs, there are many situations where we want to work
with more than one class of input file—for example, to combine differing
observables into one. For example, the meson_mass
rule we
wrote previously also outputs the amplitude of the exponential. When
combined with the average plaquette via one-loop matching, this can be
used to give an estimate of the decay constant. The syntax for this is
the same as we saw above for output:
.
rule one_loop_matching:
input:
plaquette="intermediary_data/{subdir}/pg.plaquette.json.gz",
meson="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
output:
data="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.one_loop_matching --plaquette {input.plaquette} --meson {input.meson} --output_file {output.data}"
Naming things
Even when there is only one output:
file, we are still
allowed to name it. This makes life easier if we need to add more
outputs later, and can make it a little clearer what our intent is when
we come to read the workflow later.
Spectrum plot
Write a rule that plots the pseudoscalar channel’s decay constant
against its mass, for each ensemble studied. The tool
src/plot_spectrum.py
will help with this.
Try making the filename of the tool a parameter too, so that if the script is changed, Snakemake will correctly re-run the workflow.
rule spectrum:
input:
script="src/plot_spectrum.py",
ps_mass="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
ps_decay_const="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
output:
plot="assets/plots/spectrum.{plot_filetype}"
conda: "envs/analysis.yml"
shell:
"python {input.script} {input.ps_mass} {input.ps_decay_const} --output_file {output.plot} --plot_styles {plot_styles}"
Log files
When a process run by Snakemake exits with an error code, Snakemake removes all the expected output files. Usually this is what we want: we don’t want to have potentially corrupt output, that might be used as input for subsequent rules. However, there are some classes of output file that are useful in helping to identify what caused the error in the first place: log files.
We can tell Snakemake that specified files are log files, rather than
regular output files, by placing them in a log:
block
rather than an output:
one. Snakemake will not delete a
file marked as a log if an error is raised by the process generating
it.
For example, for the ps_mass
rule above, we might
use:
# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
input:
data="raw_data/beta{beta}/out_corr",
output:
data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
plot="intermediary_data/beta{beta}/corr.ps_eff_mass.{plot_filetype}",
log:
messages="intermediary_data/beta{beta}/corr.ps_mass.log",
params:
plateau_start=lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_start"),
plateau_end=lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_end"),
conda: "envs/analysis.yml"
shell:
"python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"
|& tee
You may recall that |
is the pipe operator in the Unix
shell, taking standard output from one program and passing it to
standard input of the next. (If this is unfamiliar, you may wish to look
through the Software Carpentry introduction to [the Unix
shell][shell-novice] when you have a moment.)
Adding the &
symbol means that both the standard
output and standard error streams are piped, rather than only standard
output. This is useful for a log file, since we will typically want to
see errors there.
The tee
command “splits a pipe”; that is, it takes
standard input and outputs it both to standard output and to the
specified filename. This way, we get the log on disk, but also output to
screen as well, so we can monitor issues as the workflow runs.
Logged plots
Adjust the solution for plotting the spectrum above so that any warnings or errors generated by the plotting script are logged to a file.
rule spectrum:
input:
script="src/plot_spectrum.py",
ps_mass="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
ps_decay_const="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
output:
plot="assets/plots/spectrum.{plot_filetype}"
log:
messages="intermediary_data/{subdir}/pg.corr.{channel}_mass.log"
conda: "envs/analysis.yml"
shell:
"python {input.script} {input.ps_mass} {input.ps_decay_const} --output_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"
Dealing with errors
We’ll end the chapter by looking at a common problem that can arise
if you mistype a filename in a rule. It may seem silly to break the
workflow when we just got it working, but it will be instructive, so
let’s modify the Snakefile and deliberately specify an incorrect output
filename in the ps_mass
rule.
...
shell:
"python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data}.json_ --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"
To keep things tidy, this time we’ll manually remove the intermediary data directory.
And re-run.
$ snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json
...
TODO
There’s a lot to take in here. Some of the messages are very informative. Some less so.
- Snakemake did actually run the tool, as evidenced by the output from the program that we see on the screen.
- Python is reporting that there is a file missing.
- Snakemake complains one expected output file is missing:
intermediary_data/beta2.0/corr.ps_mass.json
. - The other expected output file
intermediary_data/beta2.0/corr.ps_eff_mass.pdf
was found but has now been removed by Snakemake. - Snakemake suggest this might be due to “filesystem latency”.
This last point is a red herring. “Filesystem latency” is not an
issue here, and never will be, since we are not using a network
filesystem. We know what the problem is, as we deliberately caused it,
but to diagnose an unexpected error like this we would investigate
further by looking at the intermediary_data/beta2.0
subdirectory.
$ ls intermediary_data/beta2.0/
TODO
Remember that Snakemake itself does not create any output files. It
just runs the commands you put in the shell
sections, then
checks to see if all the expected output files have appeared.
So if the file names created by your rule are not exactly the same as
in the output:
block you will get this error, and you will,
in this case, find that some output files are present but others
(corr.ps_eff_mass.pdf
, which was named correctly) have been
cleaned up by Snakemake.
Errors are normal
Don’t be disheartened if you see errors like the one above when first testing your new Snakemake workflows. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off, as we’ll see in the next episode.
Finally, edit the names in the Snakefile back to the correct version and re-run to confirm that all is well.
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf
Content from How Snakemake plans jobs
Last updated on 2025-07-28 | Edit this page
Estimated time: 15 minutes
Overview
Questions
- How do I visualise a Snakemake workflow?
- How does Snakemake avoid unecessary work?
- How do I control what steps will be run?
Objectives
- View the DAG for our pipeline
- Understand the logic Snakemake uses when running and re-running jobs
The DAG
You may have noticed that one of the messages Snakemake always prints is:
OUTPUT
Building DAG of jobs...
A DAG is a Directed Acyclic Graph and it can be pictured like so:
The above DAG is based on three of our existing rules, and shows all the jobs Snakemake would run to compute the pseudoscalar decay constant of the \(\beta = 4.0\) ensemble.
Note that:
- A rule can appear more than once, with different wildcards (a rule plus wildcard values defines a job)
- A rule may not be used at all, if it is not required for the target outputs
- The arrows show dependency ordering between jobs
- Snakemake can run the jobs in any order that doesn’t break dependency. For example, one_loop_matching cannot run until both ps_mass and avg_plaquette have completed, but it may run before or after count_trajectories
- This is a work list, not a flowchart, so there are no if/else decisions or loops. Snakemake runs every job in the DAG exactly once
- The DAG depends both on the Snakefile and on the requested target outputs, and the files already present
- When building the DAG, Snakemake does not look at the shell part of the rules at all. Only when running the DAG will Snakemake check that the shell commands are working and producing the expected output files
How many jobs?
If we asked Snakemake to run one_loop_matching
on all
twelve ensembles (beta2.0
to beta10.0
), how
many jobs would that be in total?
36 in total:
- 12 \(\times\)
one_loop_matching
- 12 \(\times\)
ps_mass
- 12 \(\times\)
avg_plaquette
- 0 \(\times\)
count_trajectories
Snakemake is lazy, and laziness is good
For the last few episodes, we’ve told you to run Snakemake like this:
snakemake --jobs 1 --forceall --printshellcmds --use-conda
As a reminder, the --jobs 1
flag tells Snakemake to run
one job at a time, --printshellcmds
is to print out the
shell commands before running them, and --use-conda
to
ensure that Snakemake sets up the correct Conda environment.
The --forceall
flag turns on forceall
mode,
and in normal usage you don’t want this.
At the end of the last chapter, we generated a spectrum plot by running:
snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf
Now try without the --forceall
option. Assuming that the
output files are already created, you’ll see this:
$ snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf
TODO
In normal operation, Snakemake only runs a job if:
- A target file you explicitly requested to make is missing,
- An intermediate file is missing and it is needed in the process of making a target file,
- Snakemake can see an input file which is newer than an output file, or
- A rule definition or configuration has changed since the output file was created.
The last of these relies on a ledger that Snakemake saves into the
.snakemake
directory.
Let’s demonstrate each of these in turn, by altering some files and
re-running Snakemake without the --forceall
option.
BASH
$ rm assets/plots/spectrum.pdf
$ snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
This just re-runs spectrum
, the final step.
BASH
$ rm intermediary_data/beta*/corr.ps_mass.json
$ snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
“Nothing to be done”. Some intermediate output is missing, but Snakemake already has the file you are telling it to make, so it doesn’t worry.
BASH
touch raw_data/beta*/out_pg
snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
The touch
command is a standard Unix command that resets
the timestamp of the file, so now the correlators look to Snakemake as
if they were just modified.
Snakemake sees that some of the input files used in the process of
producing assets/plots/spectrum.pdf
are newer than the
existing output file, so it needs to run the avg_plaquette
and one_loop_matching
steps again. Of course, the
one_loop_matching
step needs the pseudoscalar mass data
that we deleted earlier, so now the correlation function fitting step is
re-run also.
Explicitly telling Snakemake what to re-run
The default timestamp-based logic is really useful when you want to:
- Change or add some inputs to an existing analysis without re-processing everything
- Continue running a workflow that failed part-way
In most cases you can also rely on Snakemake to detect when you have edited a rule, but sometimes you need to be explicit, for example if you have updated an external script or changed a setting that Snakemake doesn’t see.
The --forcerun
flag allows you to explicitly tell
Snakemake that a rule has changed and that all outputs from that rule
need to be re-evaluated.
snakemake --forcerun spectrum --jobs 1--printshellcmds --use-conda assets/plots/spectrum.pdf
Note on --forcerun
Due to a quirk of the way Snakemake parses command-line options, you
need to make sure there are options after the
--forcerun ...
, before the list of target outputs. If you
don’t do this, Snakemake will think that the target files are instead
items to add to the --forcerun
list, and then when building
the DAG it will just try to run the default rule.
The easiest way is to put the --jobs
flag before the
target outputs. Then you can list multiple rules to re-run, and also
multiple targets, and Snakemake can tell which is which.
BASH
snakemake --forcerun avg_plaquette ps_mass --jobs 1 --printshellcmds --use-conda intermediary_data/beta2.0/pg.corr.ps_decay_const.json intermediary_data/beta2.5/pg.corr.ps_decay_const.json
The reason for using the --jobs
flag specifically is
that you pretty much always want this option.
The --force
flag specifies that the target outputs named
on the command line should always be regenerated, so you can use this to
explicitly re-make specific files.
This always re-runs spectrum
, regardless of whether the
output file is there already. For all intermediate outputs, Snakemake
applies the default timestamp-based logic. Contrast with
--forceall
, which runs the entire DAG every time.
Visualising the DAG
Snakemake can draw a picture of the DAG for you, if you run it like this:
snakemake --force --dag assets/plots/spectrum.pdf | gm display -
Using the --dag
option implicitly activates the
--dry-run
option so that Snakemake will not actually run
any jobs, it will just print the DAG and stop. Snakemake prints the DAG
in a text format, so we use the gm
command to make this
into a picture and show it on the screen.
Note on gm display
The gm
command is provided by the GraphicsMagick toolkit. On
systems where gm
will not display an image directly, you
can instead save it to a PNG file. You will need the dot
program from the GraphViz package
installed.
snakemake --force --dag assets/plots/spectrum.pdf | dot -Tpng > dag.png
![TODO][fig-dag2]
The boxes drawn with dotted lines indicate steps that are not to be run, as the output files are already present and newer than the input files.
Visualising the effect of the
--forcerun
and --force
flags
Run one_loop_matching
on the beta2.0
ensemble, and then use the --dag
option as shown above to
check:
How many jobs will run if you ask again to create this output with no
--force
,--forcerun
or-forceall
options?How many if you use the
--force
option?How many if you use the
--forcerun ps_mass
option?How many if you edit the metadata file so that the
ps_plateau_start
for the \(\beta=4.0\) ensemble isTODO
, rather thanTODO
?
This is a way to make the result in the first place:
- This command should show three boxes, but all are dotted so no jobs are actually to be run.
The
--force
flag re-runs only the job to create the output file, so in this case one box is solid, and only that job will run.-
With
--forcerun ps_mass
, theps_mass
job will re-run, and Snakemake sees that this also requires re-runningone_loop_matching
, so the answer is 2.If you see a message like the one below, it’s because you need to put an option after
ps_mass
or else Snakemake gets confused about what are parameters of--forcerun
, and what things are targets.ERROR
WorkflowError: Target rules may not contain wildcards.
Editing the Snakefile has the same effect as forcing the
ps_mass
rule to re-run, so again there will be two jobs to be run from the DAG.
With older versions of Snakemake this would not be auto-detected, and
in fact you can see this behaviour if you remove the hidden
.snakemake
directory. Now Snakemake has no memory of the
rule change so it will not re-run any jobs unless explicitly told
to.
Removing files to trigger reprocessing
In general, getting Snakemake to re-run things by removing files is a
bad idea, because it’s easy to forget about intermediate files that
actually contain stale results and need to be updated. Using the
--forceall
flag is simpler and more reliable. If in doubt,
and if it will not be too time consuming, keep it simple and just use
--forceall
to run the whole workflow from scratch.
For the opposite case where you want to avoid re-running particular
steps, see the ‑‑touch
option of Snakemake mentioned later in the lesson.
TODO Diagram showing jobs as coloured boxes joined by arrows representing data flow. The box labelled as kallisto_index is in green at the top, with two blue boxes labelled trimreads and two yellow boxes labelled countreads. The blue trimreads boxes have arrows into the respective yellow countreads boxes. Finally there is a kallisto_quant job shown as a red box, with incoming arrows from both the trimreads box as well as the kallisto_index box.’ } [fig-dag2]: fig/dag_2.png {alt=’ TODO A DAG for the partial workflow with four boxes, representing two trimreads jobs and a kallisto_index job, then a kallisto_quant job receiving input from the previous three. The boxes for the kallisto_index and trimreads jobs are dotted, but the kallisto_quant box is solid.’}
Key Points
- A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
- Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
- If output files already exist, Snakemake can skip parts of the DAG
- Snakemake compares file timestamps and a log of previous runs to determine what need regenerating
Content from Optimising workflow performance
Last updated on 2025-05-20 | Edit this page
Estimated time: 40 minutes
Overview
Questions
- What compute resources are available on my system?
- How do I define jobs with more than one thread?
- How do I measure the compute resources being used by a workflow?
- How do I run my workflow steps in parallel?
Objectives
- Understand CPU, RAM and I/O bottlenecks
- Understand the
threads
declaration - Use common Unix tools to look at resource usage
Processes, threads and processors
Some definitions:
- Process: A running program (in our case, each Snakemake job can be considered one process)
- Threads: Each process has one or more threads which run in parallel
- Processor: Your computer has multiple CPU cores or processors, each of which can run one thread at a time
These definitions are a little simplified, but fine for our needs. The operating system kernel shares out threads among processors:
- Having fewer threads than processors means you are not fully using all your CPU cores
- Having more threads than processors means threads have to “timeslice” on a core which is generally suboptimal
If you tell Snakemake how many threads each rule will use, and how many cores you have available, it will start jobs in parallel to use all your cores. In the diagram below, five jobs are ready to run and there are four system cores.
Listing the resources of your machine
So, to know how many threads to make available to Snakemake, we need
to know how many CPU cores we have on our machine. On Linux, we can find
out how many cores you have on a machine with the lscpu
command.
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A72
Model: 3
Thread(s) per core: 1
There we can see that we have four CPU cores, each of which can run a single thread.
On macOS meanwhile, we use the command
sysctl -n hw.ncpu
:
$ sysctl hw.ncpu
hw.ncpu: 8
In this case, we see that this Mac has eight cores.
Likewise find out the amount of RAM available:
BASH
$ free -h
total used free shared buff/cache available
Mem: 3.7Gi 1.1Gi 110Mi 97Mi 2.6Gi 2.6Gi
Swap: 199Mi 199Mi 60Ki
In this case, the machine has 3.7GiB of total RAM.
On macOS, the command is sysctl hw.memsize
:
$ sysctl hw.memsize
hw.memsize: 34359738368
This machine has 34,359,738,368 bytes of RAM in total. Dividing by the number of bytes in 1GiB (\(1024^3\) bytes), that becomes 32GiB RAM.
We don’t want to use all of this RAM, but if we don’t mind other applications being unresponsive while our workflow runs, we can use the majority of it.
Finally, to check the available disk space, on the current partition:
(or df -h
without the .
to show all
partitions) This is the same on both macOS and Linux.
Parallel jobs in Snakemake
You may want to see the relevant part of the Snakemake documentation.
We’ll force all the intermediary steps to re-run by using the
--forceall
flag to Snakemake and time the whole run using
the time
command.
Measuring how concurrency affects execution time
What is the wallclock time reported by the above command? We’ll work out the average for everyone present, or if you are working through the material on your own, repeat the measurement three times to get your own average.
Now change the Snakemake concurrency option to --jobs 2
and then --jobs 4
.
- How does the total execution time change?
- What factors do you think limit the power of this setting to reduce the execution time?
The time will vary depending on the system configuration but
somewhere around TODO seconds is expected, and this should reduce to
around TODO secs with --jobs 2
but higher
--jobs
will produce diminishing returns.
Things that may limit the effectiveness of parallel execution include:
- The number of processors in the machine
- The number of jobs in the DAG which are independent and can therefore be run in parallel
- The existence of single long-running jobs
- The amount of RAM in the machine
- The speed at which data can be read from and written to disk
There are a few gotchas to bear in mind when using parallel execution:
- Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash.
- Parallel jobs may trip over each other if they try to write to the same filename at the same time (this can happen with temporary files).
- The on-screen output from parallel jobs will be jumbled, so save any output to log files instead.
Multi-thread rules in Snakemake
In the diagram at the top, we showed jobs with 2 and 8 threads. These
are defined by adding a threads:
block to the rule
definition. We could do this for the ps_mass
rule:
rule ps_mass:
input:
data="raw_data/beta{beta}/out_corr",
output:
data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
plot="intermediary_data/beta{beta}/corr.ps_eff_mass.{plot_filetype}",
log:
messages="intermediary_data/beta{beta}/corr.ps_mass.log",
params:
plateau_start: lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_start"),
plateau_end: lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_end"),
conda: "envs/analysis.yml"
threads: 4
shell:
"python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"
You should explicitly use threads: 4
rather than
params: threads = "4"
because Snakemake considers the
number of threads when scheduling jobs. Also, if the number of threads
requested for a rule is less than the number of available processors
then Snakemake will use the lower number.
Snakemake uses the threads
variable to set common
environment variables like OMP_NUM_THREADS
. If you need to
pass the number explicitly to your program, you can use the
{threads}
placeholder to get it.
Fine-grained profiling
Rather than timing the entire workflow, we can ask Snakemake to benchmark an individual rule.
For example, to benchmark the ps_mass
step we could add
this to the rule definition:
rule ps_mass:
benchmark:
"benchmarks/ps_mass.beta{beta}.txt"
...
The dataset here is so small that the numbers are tiny, but for real data this can be very useful as it shows time, memory usage and IO load for all jobs.
Running jobs on a cluster
Learning about clusters is beyond the scope of this course, but can be essential for more complex workflows working with large amounts of data.
When working with Snakemake, there are two options to getting the workflow running on a cluster:
Similarly to most tools, we may install Snakemake on the cluster, write a job script, and execute Snakemake on our workflow inside a job.
We can teach Snakemake how to run jobs on the cluster, and run our workflow from our own computer, having Snakemake do the work of submitting and monitoring the jobs for us.
To run Snakemake in the second way, someone will need to determine the right parameters for your particular cluster and save them as a profile. Once this is working, you can share the profile with other users on the cluster, so discuss this with your cluster sysadmin.
Instructions for configuring the Slurm executor plugin can be found
in the Snakemake
plugin catalog, along with the drmaa
,
cluster-generic
and cluster-sync
plugins which
can support PBS, SGE and other cluster schedulers.
![][fig-cluster]
Running workflows on HPC or Cloud systems could be a whole course in itself. The topic is too important not to be mentioned here, but also complex to teach because you need a cluster to work on.
If you are teaching this lesson and have institutional HPC then ideally you should liaise with the administrators of the system to make a suitable installation of a recent Snakemake version and a profile to run jobs on the cluster job scheduler. In practise this may be easier said than done!
If you are able to demonstrate Snakemake running on cloud as part of a workshop then we’d much appreciate any feedback on how you did this and how it went.
Cluster demo
A this point in the course there may be a cluster demo…
Representation of a computer with four microchip icons indicating four available cores. To the right are five small green boxes representing Snakemake jobs and labelled as wanting 1, 1, 1, 2 and 8 threads respectively. ‘} [fig-cluster]: fig/cluster.jpg {alt=’ A photo of some high performance computer hardware racked in five cabinets in a server room. Each cabinet is about 2.2 metres high and 0.8m wide. The doors of the cabinets are open to show the systems inside. Orange and yellow cabling is prominent, connecting ports within the second and third racks. ’}
Key Points
- To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
- You also need to consider RAM, disk, and network bottlenecks
- Profile your jobs to see what is taking most resources
- Snakemake is great for running workflows on compute clusters
Comments in Snakefiles
In the above code, the line beginning
#
is a comment line. Hopefully you are already in the habit of adding comments to your own software. Good comments make any code more readable, and this is just as true with Snakefiles.