Snakemake for Lattice Quantum Field Theory: All in One View

Content from Running commands with Snakemake

Last updated on 2025-10-29 | Edit this page

Estimated time: 25 minutes

Overview

Questions

How do I run a simple command with Snakemake?

Objectives

Create a Snakemake recipe (a Snakefile)
Use Snakemake to compute the average plaquette of an ensemble

Introduction

Data analysis in lattice quantum field theory generally has many moving parts: you will likely have many ensembles, with differing physical and algorithmic parameters, and for each many different observables may be computed. These need to be combined in different ways, making sure that compatible statistics are used. Making sure that each step runs in the correct order is non-trivial, requiring careful bookkeeping, especially if we want to update data as ensembles are extended, if we want to take advantage of parallelism to get results faster, and if we want auditability to be able to verify later what steps were performed.

While we could build up tools to do all of these things from scratch, these are not challenges exclusive to lattice, and so we can take advantage of others’ work rather than reinventing the wheel. This frees up our time to focus on the physics challenges. The category of “tools to help run complex arrangements of tools in the right order” is called “workflow management”; there are workflow managers available, most of which are specialised to a specific class of applications.

One workflow manager developed for scientific data analysis is called Snakemake; this will be the target of this lesson. Snakemake is similar to GNU Make, in that you create a text file containing rules specifying how input files are translated to output files, and then the software will work out what rules to run to generate a specified output from the available input files. Unlike Make, Snakemake uses a syntax closely based on Python, and files containing rules can be extended using standard Python syntax. It also has many quality-of-life improvements compared to Make, and so is much better suited for writing data analysis workflows.

At this point, you should have Snakemake already installed and available to you. To test this, we can open a terminal and run

$ snakemake --version
8.25.3

If you instead get a “command not found” error, go back to the setup and check that you have completed all the necessary steps.

Environment activation

The most likely issue learners will encounter here is needing to activate their Snakemake environment when they have opened a fresh terminal. This is hopefully as simple as

conda activate snakemake

If Conda isn’t set up to automatically activate itself on starting a shell session, they may also need to run something like

source ~/miniconda3/bin/activate

where the exact path to run will depend on their specific setup.

Looking at the sample data

You should already have the sample data files unpacked. (If not, refer back to the lesson setup.) Under the su2pg/raw_data directory, you will find a series of subdirectories, each containing data for a single ensemble. In each are files containing the log of the configuration generation, the computation of the quenched meson spectrum, and the computation of the Wilson flow.

The sample data are for the SU(2) pure Yang-Mills theory, and have been generated using the HiRep code. We can look at their structure with less, for example, we might check the log of generating the \(\beta=2.0\) ensemble with the heat bath algorithm:

less raw_data/beta2.0/out_pg

Each log contains header lines describing the setup, information on the computation being computed, and results for observables computed on each configuration. Code to parse these logs and compute statistics is included with the sample data; we’ll use these in due course.

To exit less, press q.

Making a Snakefile

To start with, let’s define a rule to count the number of lines in one of the raw data files.

Within the su2pg/workflow directory, edit a new text file named Snakefile. Into it, insert the following content:

rule count_lines:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "wc -l raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

Checklist

Key points about this file

The file is named Snakefile - with a capital S and no file extension.
Some lines are indented. Indents must be with space characters, not tabs.
The rule definition starts with the keyword rule followed by the rule name, then a colon.
We named the rule count_lines. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword.
The keywords input, output, shell are all followed by a colon.
The file names and the shell command are all in "quotes".
The file names are specified relative to the root directory of your workflow.

The first line tells Snakemake we are defining a new rule. Subsequent indented lines form a part of this rule; while there are none here, any subsequent unindented lines would not be included in the rule. The input: line tells Snakemake what files to look for to be able to run this rule. If this file is missing (and there is no rule to create it), Snakemake will not consider running this rule. The output: line tells Snakemake what files to expect the rule to generate. If this file is not generated, then Snakemake will abort the workflow with an error. Finally, the shell: block tells Snakemake what shell commands to run to get the specified output from the given input.

Going back to the shell now, we can test this rule. First up, we need to enter the directory containing the workflow

cd su2pg

We’ll spend most of the rest of the lesson in this directory.

Callout

Activate your environment

To call snakemake, we need to have active the environment that we created in the setup. Current versions of Conda by default prepend this environment name to your prompt, so if you don’t see (snakemake) in your prompt, you will need to activate this

conda activate snakemake

From here, we can now run the command

snakemake --cores 1 --forceall --printshellcmds intermediary_data/beta2.0/pg.count

If we’ve made any transcription errors in the rule (missing quotes, bad indentations, etc.), then it will become clear at this point, as we’ll receive an error that we will need to fix.

Callout

Snakemake interprets all inputs and ouputs as relative to the working directory. For this reason, you should always run snakemake from the root of your workflow repository.

An option to make this easier is to have a terminal open to the correct directory that you don’t use cd in, so it is always in the right place. You can edit your workflow in a separate window, either in another terminal with nano or vim, or in a separate text editing application.

Directories

Learners less experienced with the shell may want to cd into directories to edit files; if they do this and forget to cd back out again, they will encounter difficulties as Snakemake may not be able to find the Snakefile or the input files.

If they try to work around this, they may end up with multiple Snakefiles or ones with inputs pointing at incorrect relative paths.

Technically, you can specify absolute paths in Snakefiles, but this is not recommended, for portability reasons. For example, when using Snakemake to execute some rules on another machine, this would fail as it cannot gather the dependencies into the correct location; similarly if someone else were to run a workflow on their own machine, the home directory is unlikely to be the same, so the workflow would fail.

New researchers frequently like to hardcode absolute paths to their data, so this is an important point to reinforce.

For now, we will consistently run snakemake with the --cores 1 --forceall --printshellcmds options. As we move through the lesson, we’ll explain in more detail when we need to modify them.

Let’s check that the output was correctly generated:

$ cat intermediary_data/beta2.0/pg.count
  31064 raw_data/beta2.0/out_pg

Callout

You might have noticed that we are grouping files into directories like raw_data and intermediary_data. It is generally a good idea to keep raw input data separate from data generated by the analysis. This means that if you need to run a clean analysis starting from your input data, then it is much easier to know what to remove and what to keep. Ideally, the raw_data directory should be kept read-only, so that you don’t accidentally modify your input data. Similarly, it is a good idea to separate out “files that you want to include in a paper” from “intermediary files generated by the workflow but not needed in the paper”; we’ll talk more about that in a later section.

You might also worry that your tooling will need to use mkdir to create these directories; in fact, Snakemake will automatically create all directories where it expects to see output from rules that it runs.

Read-only data

The example data for this lesson uses read-only raw data throughout, including the containing directories. If learners end up with multiple copies of the data and need to delete one, they should use the commands:

chmod -R u+w raw_data
rm -r raw_data

Having the containing directories read-only means that extra output files can’t be added by accident. It’s a relatively strict measure—while assembling data, one would only want the files read-only, so you could keep adding more files as they were ready.

Use of the --forceall flag

In the first few episodes we always run Snakemake with the --forceall flag, and it’s not explained what this does until Ep. 04. The rationale is that the default Snakemake behaviour when pruning the DAG leads to learners seeing different output (typically the message “nothing to be done”) when repeating the exact same command. This can seem strange to learners who are used to scripting and imperative programming.

The internal rules used by Snakemake to determine which jobs in the DAG are to be run, and which skipped, are pretty complex, but the behaviour seen under --forceall is much more simple and consistent; Snakemake simply runs every job in the DAG every time. You can think of --forceall as disabling the lazy evaluation feature of Snakemake, until we are ready to properly introduce and understand it.

Challenge

Running Snakemake

Run snakemake --help | less to see the help for all available options. What does the --printshellcmds option in the snakemake command above do?

Protects existing output files
Prints the shell commands that are being run to the terminal
Tells Snakemake to only run one process at a time
Prompts the user for the correct input file

Hint: you can search in the text by pressing /, and quit back to the shell with q

Show me the solution

Prints the shell commands that are being run to the terminal

This is such a useful thing we don’t know why it isn’t the default! The --cores 1 option is what tells Snakemake to only run one process at a time, and we’ll stick with this for now as it makes things simpler. The --forceall option tells Snakemake to always recreate output files, and we’ll learn about protected outputs much later in the course. Answer 4 is a total red herring, as Snakemake never prompts interactively for user input.

Counting trajectories

The count of output lines isn’t particularly useful. Potentially more interesting is the number of trajectories in a given file. In a HiRep generation log, each trajectory concludes with a line of the form

OUTPUT

[MAIN][0]Trajectory #1: generated in [39.717707 sec]

We can use grep to count these, as

BASH

grep -c generated raw_data/beta2.0/out_pg

Challenge

Counting sequences in Snakemake

Modify the Snakefile to count the number of trajectories in raw_data/beta2.0/out_pg, rather than the number of lines.

Rename the rule to count_trajectories
Keep the output file name the same
Remember that the result needs to go into the output file, not just be printed on the screen
Test the new rule once it is done.

Show me the solution

rule count_trajectories:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

Key Points

Before running Snakemake you need to write a Snakefile
A Snakefile is a text file which defines a list of rules
Rules have inputs, outputs, and shell commands to be run
You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Content from Running Python code with Snakemake

Last updated on 2025-10-29 | Edit this page

Estimated time: 25 minutes

Overview

Questions

How do I configure an environment to run Python with Snakemake?

Objectives

Create a Conda environment definition
Use Snakemake to instantiate this environment and use it to run Python code

Why define an environment

Snakemake is written in Python, so anyone running a Snakemake workflow already has Python available. In principle, we could make use of this installation to run any Python code we need to run in our workflow. However, it’s more than likely we will need to make use of libraries that are not installed as part of the Snakemake installation. At that point, we would either need to install additional libraries into our Snakemake environment, which might be used by other analyses that need their own sets of libraries that could create conflicts, or to create a second Snakemake environment for this analysis. If different steps of our workflow need different, conflicting sets of libraries then this becomes more complicated again.

We would also like those trying to reproduce our work to be able to run using exactly the same software environment that we used in our original work. In principle, we could write detailed documentation specifying which packages to install; however, it is both more precise and more convenient to define the environment as a data file, which Conda can use to build the same environment every time.

Even better, we can tell Snakemake to use a specific Conda environment for each rule we define.

A basic environment definition

Conda environment definitions are created in YAML-format files. These specify what Conda packages are needed (including the target version of Python), as well as any Pip packages that are installed.

Callout

Some packages give you a choice as to whether to install using Conda or Pip. When working interactively with an environment, using Pip consistently typically reduces the chance of getting into a state where Conda is unable to install packages. That is less of a problem when constructing new environments from definition files, but even so, using Pip where possible will typically allow environments to resolve and install more quickly.

By convention, Conda environment definitions in Snakemake workflows are placed in a workflow/envs/ directory. Let’s create workflow/envs/analysis.yml now, and place the following content into it

YAML

name: su2pg_analysis
channels:
  - conda-forge
dependencies:
  - pip=24.2
  - python=3.12.6
  - pip:
      - h5py==3.11.0
      - jinja2==3.1.6
      - matplotlib==3.9.2
      - numpy==2.1.1
      - pandas==2.2.3
      - scipy==1.14.1
      - uncertainties==3.2.2
      - corrfitter==8.2
      - -e ../../libs/su2pg_analysis

This will install the specified versions of h5py, Matplotlib, Numpy, Pandas, Scipy, uncertainties, and corrfitter, as well as the analysis tools provided in the libs directory. The latter are installed in editable mode, so if you need to modify them to fix bugs or add functionality while working on your workflow, you don’t need to remember to manually reinstall them.

Using an environment definition in a Snakefile

Now that we have created an environment file, we can use it in our Snakefile to compute the average plaquette from a configuration generation log. Let’s add the following rule to workflow/Snakefile:

rule avg_plaquette:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette raw_data/beta2.0/out_pg --output_file intermediary_data/beta2.0/pg.plaquette.json.gz"

The conda: block tells Snakemake where to find the Conda environment that should be used for running this rule. Let’s test this now:

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json.gz

We need to specify --use-conda to tell Snakemake to pay attention to the conda: specification.

Let’s check now that the output was correctly generated:

cat intermediary_data/beta2.0/pg.plaquette.json.gz | gunzip | head -n 31

OUTPUT

{
 "program": "pyerrors 2.14.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-10-08 19:51:58 +0100",
 "host": "azusa, Linux-6.8.0-85-generic-x86_64-with-glibc2.39",
 "description": {
  "INFO": "This JSON file contains a python dictionary that has been parsed to a list of structures. OBSDICT contains the dictionary, where Obs or other structures have been replaced by DICTOBS[0-9]+. The field description contains the additional description of this JSON file. This file may be parsed to a dict with the pyerrors routine load_json_dict.",
  "OBSDICT": {
   "plaquette": "DICTOBS0"
  },
  "description": {
   "group_family": "SU",
   "num_colors": 2,
   "nt": 48,
   "nx": 24,
   "ny": 24,
   "nz": 24,
   "beta": 2.0,
   "num_heatbath": 1,
   "num_overrelaxed": 4,
   "num_thermalization": 1000,
   "thermalization_time": 2453.811479,
   "num_trajectories": 10010
  }
 },
 "obsdata": [{
   "type": "Obs",
   "layout": "1",
   "tag": ["plaquette"],
   "value": [0.5012064525235401],

Some of the output will differ on your machine, since this library tracks provenance, such as where and when the code was run, in the output file.

Callout

You might notice that this output contains a lot of information besides the average plaquette. These are metadata—that is, data describing the data, which help us understand it and make better use of it. This includes physics parameters describing what the data refer to, and provenance information describing how and when it was computed.

If you imagine a script that outputs only the average plaquette and its uncertainty:

0.501206452535401 5.027076650629463e-06

then seeing just this file in isolation, it would be much harder to understand what it means or where it comes from. You would need some other bookkeeping system to track the physics parameters. Bundling these into the files means we are less likely to accidentally create a situation where the data are presented with incorrect labels.

We’re using JSON format to output the results; if you are not using a library that automatically generates JSON, you might instead use CSV or any other format. The important part is that it can be read and written easily, and can hold the metadata that you need to keep.

We compress each output file with GZIP (the .gz extension), because the pyerrors library that we are using does this automatically. Since we are generating one output file per computation we’re performing, we’ll end up with a lot of files; each has a lot of metadata within it, so this might take up a lot of space without compression.

Challenge

More plaquettes

Add a second rule to compute the average plaquette in the file intermediary_data/beta2.2/out_pg. Add this to the same Snakefile you already made, under the avg_plaquette rule, and run your rules in the terminal. When running the snakemake command you’ll need to tell Snakemake to make both the output files.

Show me the solution

You can choose whatever name you like for this second rule, but it can’t be avg_plaquette, as rule names need to be unique within a Snakefile. So in this example answer we use avg_plaquette2.

rule avg_plaquette2:
    input: "raw_data/beta2.2/out_pg"
    output: "intermediary_data/beta2.2/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette raw_data/beta2.2/out_pg --output_file intermediary_data/beta2.2/pg.plaquette.json.gz"

Then in the shell:

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json.gz intermediary_data/beta2.2/pg.plaquette.json.gz

If you think writing a separate rule for each output file is silly, you are correct. We’ll address this next!

Key Points

Snakemake will manage Conda environments for you, to help ensure that workflows always use a consistent set of packages
Use the --use-conda option to snakemake to enable this behaviour
Use conda: to specify a Conda environment definition (.yml) file. The path of this is relative to the file in which it is specified.
Conda environment files are conventionally put in the workflow/envs directory

Content from Placeholders and wildcards

Last updated on 2025-10-29 | Edit this page

Estimated time: 35 minutes

Overview

Questions

How do I make a generic rule?
How does Snakemake decide what rule to run?

Objectives

Use Snakemake to compute the plaquette in any file
Understand the basic steps Snakemake goes through when running a workflow
See how Snakemake deals with some errors

Making rules more generic

In the previous two episodes, we wrote rules to count the number of generated trajectories in, and compute the average plaquette of, one ensemble. As a reminder, this was one such rule:

rule count_trajectories:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

When we needed to do the same for a second ensemble, we made a second copy of the rule, and changed the input and output filenames. This is obviously not scalable to large analyses: instead, we would like to write one rule for each type of operation we are interested in. To do this, we’ll need to use placeholders and wildcards. Such a generic rule might look as follows:

# Count number of generated trajectories for any ensemble
rule count_trajectories:
    input: "raw_data/{subdir}/out_pg"
    output: "intermediary_data/{subdir}/pg.count"
    shell:
        "grep -c generated {input} > {output}"

Callout

Comments in Snakefiles

In the above code, the line beginning # is a comment line. Hopefully you are already in the habit of adding comments to your own software. Good comments make any code more readable, and this is just as true with Snakefiles.

{subdir} here is an example of a wildcard Wildcards are used in the input and output lines of the rule to represent parts of filenames. Much like the * pattern in the shell, the wildcard can stand in for any text in order to make up the desired filename. As with naming your rules, you may choose any name you like for your wildcards; so here we used subdir, since it is describing a subdirectory. If subdir is set to beta2.0 then the new generic rule will have the same inputs and outputs as the original rule. Using the same wildcards in the input and output is what tells Snakemake how to match input files to output files.

If two rules use a wildcard with the same name then Snakemake will treat them as different entities—rules in Snakemake are self-contained in this way.

Meanwhile, {input} and {output} are placeholders. Placeholders are used in the shell section of a rule. Snakemake will replace them with appropriate values before running the command: {input} with the full name of the input file, and {output} with the full name of the output file.

If we had wanted to include the value of the subdir wildcard directly in the shell command, we could have used the placeholder {wildcards.subdir}, but in many cases, as here, we just need the {input} and {output} placeholders.

Let’s test this general rule now:

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.count

As previously, if you see errors at this point, there is likely a problem with your Snakefile; check that all the rules match the ones that have appeared here, and that there aren’t multiple rules with the same name.

Challenge

General plaquette computation

Modify your Snakefile so that it can compute the average plaquette for any ensemble, not just the ones we wrote specific rules for in the previous episode.

Test this with some of the values of \(\beta\) present in the raw data.

Show me the solution

The replacement rule should look like:

# Compute average plaquette for any ensemble from its generation log
rule avg_plaquette:
    input: "raw_data/{subdir}/out_pg"
    output: "intermediary_data/{subdir}/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette {input} --output_file {output}"

To test this, for example:

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta1.8/pg.plaquette.json.gz

Challenge

Choosing the right wildcards

Our rule puts the trajectory counts into output files named like pg.count. How would you have to change the count_trajectories rule definition if you instead wanted:

the output file for raw_data/beta1.8/out_hmc to be intermediary_data/beta1.8/hmc.count?
the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/beta1.8/mass_fun-0.63/hmc.count?
the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/hmc_b1.8_m-0.63.count (for raw_data/beta1.9/mass_fun-0.68/out_hmc to be intermediary_data/hmc_b1.9_m-0.68.count, etc.)?
the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/hmc_m-0.63.count (for raw_data/beta1.9/mass_fun-0.68/out_hmc to be intermediary_data/hmc_m-0.68.count, etc.)?

(Assume that both pure-gauge and HMC logs tag generated trajectories the same way. Note that input files for the latter data are not included in the sample data, so these will not work as-is.)

Show me the solution

In both cases, there is no need to change the shell part of the rule at all.

input: "raw_data/{subdir}/out_hmc"
output: "intermediary_data/{subdir}/hmc.count"

This can be done by changing only the static parts of the input: and output: lines.

This in fact requires no change from the previous answer. The wildcard {subdir} can include /, so can represent multiple levels of subdirectory.

input: "raw_data/beta{beta}/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_b{beta}_m{mass}.count"

In this case, it was necessary to change the wildcards, because the subdirectory name needs to be split to obtain the values of \(\beta\) and \(m_{\mathrm{fun}}\). The names chosen here are {beta} and {mass}, but you could choose any names, as long as they match between the input and output parts.

This one isn’t possible, because Snakemake cannot determine which input file you want to count by matching wildcards on the file name intermediary_data/hmc_m-0.63.count. You could try a rule like this:

input: "raw_data/beta1.8/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_m{mass}.count"

…but it only works because \(\beta\) is hard-coded into the input line, and the rule will only work on this specific sample, not other cases where other values of \(\beta\) may be wanted. In general, input and output filenames need to be carefully chosen so that Snakemake can match everything up and determine the right input from the output filename.

Callout

Filenames aren’t data

Notice that in some examples we can pull out the value of \(\beta\) from the name of the directory in which the file is located. However, ideally, we should avoid relying on this being correct. The name and location are useful for us to find the correct file, but we should try to ensure that the file contents also contain these data, and that we make use of those data in preference to the filename.

Snakemake order of operations

We’re only just getting started with some simple rules, but it’s worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases:

Prepares to run:
1. Reads in all the rule definitions from the Snakefile
Plans what to do:
1. Sees what file(s) you are asking it to make
2. Looks for a matching rule by looking at the outputs of all the rules it knows
3. Fills in the wildcards to work out the input for this rule
4. Checks that this input file is actually available
Runs the steps:
1. Creates the directory for the output file, if needed
2. Removes the old output file if it is already there
3. Only then, runs the shell command with the placeholders replaced
4. Checks that the command ran without errors and made the new output file as expected

For example, if we now ask Snakemake to generate a file named intermediary_data/wibble_1/pg.count:

OUTPUT

$ snakemake --cores 1 --forceall --printshellcmds intermediary_data/wibble_1/pg.count
Building DAG of jobs...
MissingInputException in line 1 of /home/zenmaster/data/su2pg/workflow/Snakefile:
Missing input files for rule count_trajectories:
    output: intermediary_data/wibble_1/pg.count
    wildcards: subdir=wibble_1
    affected files:
        raw_data/wibble_1/out_pg

Snakemake sees that a file with a name like this could be produced by the count_trajectories rule. However, when it performs the wildcard substitution it sees that the input file would need to be named raw_data/wibble_1/out_pg, and there is no such file available. Therefore Snakemake stops and gives an error before any shell commands are run.

Callout

Dry-run (`--dry-run`) mode

It’s often useful to run just the first two phases, so that Snakemake will plan out the jobs to run, and print them to the screen, but never actually run them. This is done with the --dry-run flag, eg:

snakemake --dry-run --forceall --printshellcmds intermediary_data/beta1.7/pg.count

Instructor Note

If the learner copies down a previous command here, then they might include a --use-conda. In that case, Snakemake will build the Conda environments, even though it will not need to use them.

The amount of checking may seem pedantic right now, but as the workflow gains more steps this will become very useful to us indeed.

Key Points

Snakemake rules are made generic with placeholders and wildcards
Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
Snakemake checks for various error conditions and will stop if it sees a problem

Content from Chaining rules

Last updated on 2025-10-09 | Edit this page

Estimated time: 50 minutes

Overview

Questions

How do I combine rules into a workflow?
How can I make a rule with multiple input files?
How should I refer to multiple files with similar names?

Objectives

Use Snakemake to compute and then plot the average plaquettes of multiple ensembles
Understand how rules are linked by filename patterns
Be able to use multiple input files in one rule

A pipeline of multiple rules

We have so far been able to count the number of generated trajectories, and compute the average plaquette, given an output log from the configuration generation. However, an individual average plaquette is not interesting in isolation; what is more interesting is how it varies between different values of the input parameters. To do this, we will need to take the output of the avg_plaquette rule that we defined earlier, and use it as input for another rule.

Let’s define that rule now:

# Take individual data files for average plaquette and plot combined results
rule plot_avg_plaquette:
    input:
        "intermediary_data/beta1.8/pg.plaquette.json.gz",
        "intermediary_data/beta2.0/pg.plaquette.json.gz",
        "intermediary_data/beta2.2/pg.plaquette.json.gz",
    output:
        "assets/plots/plaquette_scan.pdf"
    conda: "envs/analysis.yml"
    shell:
        "python src/plot_plaquette.py {input} --output_filename {output}"

Callout

You can see that here we’re putting “files that want to be included in a paper” in an assets directory, similarly to the raw_data and intermediary_data directories we discussed in a previous episode. It can be useful to further distinguish plots, tables, and other definitions, by using subdirectories in this directory.

Rather than one input, as we have seen in rules so far, this rule requires three. When Snakemake substitutes these into the {input} placeholder, it will automatically add a space between them. Let’s test this now:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Look at the logging messages that Snakemake prints in the terminal. What has happened here?

Snakemake looks for a rule to make assets/plots/plaquette_scan.pdf
It determines that the plot_avg_plaquette rule can do this, if it has intermediary_data/beta1.8/pg.plaquette.json.gz, intermediary_data/beta2.0/pg.plaquette.json.gz, and intermediary_data/beta2.2/pg.plaquette.json.gz.
Snakemake looks for a rule to make intermediary_data/beta1.8/pg.plaquette.json.gz
It determines that avg_plaquette can make this if subdir=beta1.8
It sees that the input needed is therefore raw_data/beta1.8/out_pg
Now Snakemake has reached an available input file, it runs the avg_plaquette rule.
It then looks through the other two \(\beta\) values in turn, repeating the process until it has all of the needed inputs.
Finally, it runs the plot_avg_plaquette rule.

Here’s a visual representation of this process:

A visual representation of the above process showing the rule definitions, with arrows added to indicate the order wildcards and placeholders are substituted. Blue arrows start from the input of the `plot_avg_plaquette` rule, which are the files `intermediary_data/beta1.8/pg.plaquette.json.gz`, `intermediary_data/beta2.0/pg.plaquette.json.gz`, and `intermediary_data/beta2.2/pg.plaquette.json.gz`, then point down from components of the filename to wildcards in the output of the `avg_plaquette` rule. Orange arrows then track back up through the shell parts of both rules, where the placeholders are, and finally back to the target output filename at the top.

This, in a nutshell, is how we build workflows in Snakemake.

Define rules for all the processing steps
Choose input and output naming patterns that allow Snakemake to link the rules
Tell Snakemake to generate the final output files

If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

Callout

Choosing file name patterns

Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it, and most times there are several options that will work, but in all cases the file names you choose will need to be consistent and unabiguous.

Making file lists easier

In the rule above, we plotted the average plaquette for three values of \(\beta\) by listing the files expected to contain their values. In fact, we have data for a larger number of \(\beta\) values, but typing out each file by hand would be quite cumbersome. We can make use of the expand() function to do this more neatly:

    input:
        expand(
            "intermediary_data/beta{beta}/pg.plaquette.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),

The first argument to expand() here is a template for the filename, and subsequent keyword arguments are lists of variables to fill into the placeholders. The output is the cartesian product of all the parameter lists.

We can check that this works correctly:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Challenge

Tabulating trajectory counts

The script src/tabulate_counts.py will take a list of files containing trajectory counts, and output them in a LaTeX table. Write a rule to generate this table for all values of \(\beta\), and output it to assets/tables/counts.tex.

Solution 1

The rule should look like:

# Output a LaTeX table of trajectory counts
rule tabulate_counts:
    input:
        expand(
            "intermediary_data/beta{beta}/pg.count",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        )
    output: "assets/tables/counts.tex"
    conda: "envs/analysis.yml"
    shell:
        "python src/tabulate_counts.py {input} > {output}"

To test this, for example:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/tables/counts.tex

Challenge

Tabulating trajectory counts (continued)

This setup currently requires reading the value of \(\beta\) from the filename. Why is this not ideal? How would the workflow need to be changed to avoid this?

Solution 2

It’s easy for files to be misnamed when creating or copying them. Putting the wrong data into the file is harder, especially when it’s a raw data file generated by the same program as the rest of the data. (If the wrong value were given as input, this could happen, but the corresponding output data would also be generated at that incorrect value. Provided the values are treated consistently, the downstream analysis could in fact still be valid, just not exactly as intended.)

Currently, grep -c is used to count the number of trajectories. This would need to be replaced or supplemented with a tool that read out the value of \(\beta\) from the input log, and outputs it along with the trajectory count. The src/tabulate_counts.py script could then be updated to use this number, rather than the filename.

In fact, the plaquette module does just this; in addition to the average plaquette, it also records the number of trajectories generated as part of the metadata and provenance information it tracks.

Key Points

Snakemake links up rules by iteratively looking for rules that make missing inputs
Careful choice of filenames allows this to work
Rules may have multiple named input files (and output files)
Use expand() to generate lists of filenames from a template

Content from Metadata and parameters

Last updated on 2025-10-29 | Edit this page

Estimated time: 45 minutes

Overview

Questions

How do I specify and configure parameters my whole workflow relies on?
How do I set up parameters for individual jobs?

Objectives

Be able to use the config object to accept external configuration
Be able to adjust the config object by using YAML files, and override individual options
Know how to format a metadata CSV file
Be able to read a metadata CSV file and make use of it in a workflow

Global parameters

Thus far, each of our rules has taken one or more input files, and given output files solely based on that. However, in some cases we may want to control options without having them within an input file.

For example, in the previous episode, we wrote a rule to plot a graph using the script src/plot_plaquette.py. The style of output we got was good for a paper, but if we were producing a poster, or putting the plot onto a slide with a dark background, we may wish to use a different output style. The plot_plaquette.py script accepts a --styles argument, to tell it what style file to use to plot. One way to make use of this would be to add --styles styles/paper.mplstyle directly to the shell: block. However, if we had many such rules, and wanted to switch from generating output for a paper to generating it for a poster, then we would need to change the value in many places.

Instead, we can define a configuration file for our workflow. Create a new file at config/config.yaml, defining the path to the style file that we want:

YAML

plot_styles: "styles/jhep.mplstyle"

Now, we can specify in our Snakefile to read this file by adding the following line near the top of the file:

configfile: "config/config.yaml"

Then, when we use a script to generate a plot, we can update the shell: block of the corresponding rule similarly to

"python src/plot_plaquette.py {input} --output_filename {output} --plot_styles {config[plot_styles]}"

Snakemake will substitute the value it reads from the configuration file in place of the {config[plot_styles]} placeholder. (Note that unlike standard Python syntax, we don’t need quotes around the plot_styles key string.)

Let’s double-check that our workflow still works, running

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Now that we have separated out our configuration from the workflow itself, we can alter the configuration. For example, we may decide to test plotting in the style of a different publication. We can test this by changing jhep to prd, and running

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

We can see that the generated file now uses a different set of fonts.

Before continuing, let’s reset the workflow back to using the jhep style.

Challenge

Wilson flow

The tool su2pg_analysis.w0 computes the scale \(w_0\) given a log of the energy density during evolution of the Wilson flow for an ensemble. To do this, the reference scale \(\mathcal{W}_0\) needs to be passed to the --W0 parameter. Use this, and the logs stored in the files out_wflow for each ensemble’s raw data directory, to output the \(w_0\) scale in a file wflow.w0.json for each ensemble, taking the reference value \(\mathcal{W_0} = 0.2\).

Show me the solution

In config.yaml, add:

YAML

W0_reference: 0.2

Then, add the following to the Snakefile:

# Compute w0 scale for single ensemble for fixed reference scale
rule w0:
    input: "raw_data/{subdir}/out_wflow"
    output: "intermediary_data/{subdir}/wflow.w0.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.w0 {input} --W0 {config[W0_reference]} --output_file {output}"

Test this with:

$ snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/wflow.w0.json.gz
$ cat intermediary_data/beta2.0/wflow.w0.json.gz | gunzip | head -n 26
{
 "program": "pyerrors 2.14.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-10-08 20:28:56 +0100",
 "host": "azusa, Linux-6.8.0-85-generic-x86_64-with-glibc2.39",
 "description": {
  "INFO": "This JSON file contains a python dictionary that has been parsed to a list of structures. OBSDICT contains the dictionary, where Obs or other structures have been replaced by DICTOBS[0-9]+. The field description contains the additional description of this JSON file. This file may be parsed to a dict with the pyerrors routine load_json_dict.",
  "OBSDICT": {
   "plaquette_w0": "DICTOBS0",
   "clover_w0": "DICTOBS1"
  },
  "description": {
   "group_family": "SU",
   "num_colors": 2,
   "nt": 48,
   "nx": 24,
   "ny": 24,
   "nz": 24,
   "beta": 2.0
  }
 },
 "obsdata": [{
   "type": "Obs",
   "layout": "1",
   "value": [2.2332066807096895],

Discussion

Generating different filetypes

In addition to different plot styles, we may also wish to generate different filetypes. PDF is useful for including in LaTeX, but SVG may be a better format to use with some tools.

Add a definition to your config.yaml file:

YAML

plot_filetype: ".pdf"

and update the output: block of the plot_avg_plaquette rule as:

    output:
        multiext("assets/plots/plaquette_scan", config["plot_filetype"]),

Note that unlike in the shell block, we can’t substitute from config into an output string, so instead we need to use the multiext helper function to append the extension. (We could instead have used an f-string, expand(), or any number of other ways to make this substitution.)

Challenge

Changing configurations

Now that we have separated out the data controlling our workflow from the code implementing it, we can individual overwrite parameters, or swap in an entirely different configuration file.

We would like to test plotting for a different journal. Re-run the workflow for the plaquette scan, using the prd.mplstyle style file, as
```
snakemake --cores 1 --forceall --printshellcmds --use-conda --config plot_styles=styles/prd.mplstyle -- assets/plots/plaquette_scan.pdf
```
(Note that we now need to add an extra -- to tell Snakemake that the list of --config options is complete.)
We would like to create a plot to include in a poster, in SVG format. Create a new configuration file specifying to use the poster.mplstyle style file and the SVG file format, and re-run the workflow using the --configfile option to specify this new file.

Show me the solution

You should now have a file called something like config/poster.yaml, with content

YAML

plot_styles: "styles/poster.mplstyle"
plot_format: ".svg"
W0_reference: 0.2

To test this, run

snakemake --cores 1 --forceall --printshellcmds --use-conda --configfile config/poster.yaml -- assets/plots/plaquette_scan.pdf

Similarly to the previous examples, we need to use the -- option to tell Snakemake to stop trying to read more config filenames.

Metadata from a file

We would frequently like our rules to depend on data that are specific to the specific ensembles being analysed. For example, consider the rule:

# Compute pseudoscalar mass and amplitude with fixed plateau
rule ps_mass:
    input: "raw_data/{subdir}/out_corr"
    output: "intermediary_data/{subdir}/corr.ps_mass.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start 18 --plateau_end 23"

This rule hardcodes the positions of the start and end of the plateau region. In most studies, each ensemble and observable may have a different plateau position, so there is no good value to hardcode this to. Instead, we’d like a way of picking the right value from some list of parameters that we specify.

We could do this within the Snakefile, but where possible it is good to avoid mixing data with code. We shouldn’t need to modify our code every time we add or modify the data it is analysing. Instead, we’d like to have a dedicated file containing these parameters, and to be able to have Snakemake read it and pick out the correct values.

To do this, we can exploit the fact that Snakemake is an extension of Python. In particular, Snakemake makes use of the Pandas library for tabular data, which we can use to read in a CSV files.

Instructor Note

CSVs aren’t the only way to do this; for more complex data, YAML or even JSON may be a better choice. But CSV is good for most purposes, and easier to get started with. It’s also more readable for non-specialists investigating the workflow, which is valuable in and of itself.

Let’s add the following to the top of the file:

import pandas

metadata = pandas.read_csv("metadata/ensemble_metadata.csv")

The file being read here is a CSV (Comma Separated Values) file. We can create, view, and modify this with the spreadsheet tool of our choice. Let’s take a look at the file now.

Screenshot of a spreadsheet application showing the file `metadata/ensemble_metadata.csv`.

You can see that we have columns defining metadata to identify each ensemble, and columns for parameters relating to the analysis of each ensemble.

Now, how do we tell Snakemake to pull out the correct value from this?

# Compute pseudoscalar mass and amplitude, read plateau from metadata
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output: "intermediary_data/beta{beta}/corr.ps_mass.json.gz"
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end}"

We’ve done a couple of things here. Firstly, we’ve made explicit the reference to \(\beta\) in the file paths, so that we can use beta as a wildcard, similarly to in the challenge in the previous episode. Secondly, we’ve introduced a params: block. This is how we tell Snakemake about quantities that may vary from run to run, but that are not filenames. Thirdly, we’ve used the lookup() function to search the metadata dataframe for the ensemble that we are considering. Finally, we’ve used {params.plateau_start} and {params.plateau_end} placeholders to use these parameters in the shell command that gets run.

Let’s test this now:

$ snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json.gz
$ cat intermediary_data/beta2.0/corr.ps_mass.json.gz | gunzip | head -n 29
{
 "program": "pyerrors 2.14.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-10-08 20:41:12 +0100",
 "host": "azusa, Linux-6.8.0-85-generic-x86_64-with-glibc2.39",
 "description": {
  "INFO": "This JSON file contains a python dictionary that has been parsed to a list of structures. OBSDICT contains the dictionary, where Obs or other structures have been replaced by DICTOBS[0-9]+. The field description contains the additional description of this JSON file. This file may be parsed to a dict with the pyerrors routine load_json_dict.",
  "OBSDICT": {
   "mass": "DICTOBS0",
   "amplitude": "DICTOBS1"
  },
  "description": {
   "group_family": "SU",
   "num_colors": 2,
   "representation": "fun",
   "nt": 48,
   "nx": 24,
   "ny": 24,
   "nz": 24,
   "beta": 2.0,
   "mass": 0.0,
   "channel": "ps"
  }
 },
 "obsdata": [{
   "type": "Obs",
   "layout": "1",
   "value": [2.1988677698535195],

Challenge

Vector mass

Add a rule to compute the vector meson mass and amplitude, using the columns beginning v_ in the ensemble metadata file for the plateau limits.

Hint: su2pg_analysis.meson_mass accepts an argument --channel, which defaults to ps.

Show me the solution

This is very close to the rule for the PS mass.

# Compute vector mass and amplitude, read plateau from metadata
rule v_mass:
    input: "raw_data/beta{beta}/out_corr"
    output: "intermediary_data/beta{beta}/corr.v_mass.json.gz"
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="v_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="v_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --channel v --output_file {output} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end}"

We can again verify this using

$ snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json.gz
$ cat intermediary_data/beta2.0/corr.ps_mass.json.gz | gunzip | head -n 29
{
 "program": "pyerrors 2.14.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-10-08 20:43:06 +0100",
 "host": "azusa, Linux-6.8.0-85-generic-x86_64-with-glibc2.39",
 "description": {
  "INFO": "This JSON file contains a python dictionary that has been parsed to a list of structures. OBSDICT contains the dictionary, where Obs or other structures have been replaced by DICTOBS[0-9]+. The field description contains the additional description of this JSON file. This file may be parsed to a dict with the pyerrors routine load_json_dict.",
  "OBSDICT": {
   "mass": "DICTOBS0",
   "amplitude": "DICTOBS1"
  },
  "description": {
   "group_family": "SU",
   "num_colors": 2,
   "representation": "fun",
   "nt": 48,
   "nx": 24,
   "ny": 24,
   "nz": 24,
   "beta": 2.0,
   "mass": 0.0,
   "channel": "v"
  }
 },
 "obsdata": [{
   "type": "Obs",
   "layout": "1",
   "value": [2.213089845075537],

If it seems awkward to need to define multiple rules that differ only in which channel they look at, this is a good point, and one that we will deal with in the episode on Awkward Corners.

Key Points

Use a YAML file to define parameters to the workflow, and attach it using configfile: near the top of the file.
Override individual options at run-time with the --config option.
Load additional parameter files at run-time using the --configfile option.
Use a CSV file loaded into a Pandas dataframe to load ensemble-specific metadata.
Use lookup() to get information out of the dataframe in a rule.
Use params: to define job-specific parameters that do not describe filenames.

Content from Multiple inputs and outputs

Last updated on 2025-10-30 | Edit this page

Estimated time: 35 minutes

Overview

Questions

How do I write rules that require or use more than one file, or class of file?
How do I tell Snakemake to not delete log files when jobs fail?
What do Snakemake errors look like, and how do I read them?

Objectives

Be able to write rules with multiple named inputs and outputs
Know how and when to specify log: within a rule
Be aware that Snakemake errors are common
Understand how to approach reading Snakemake errors when they occur

Multiple outputs

Quite frequently, we will want a rule to be able to generate more than one file. It’s important we let Snakemake know about this, both so that it can instruct our tools on where to place these files, and so it can verify that they are correctly created by the rule. For example, when fitting a correlation function with a plateau region that we specify, it’s important to look at an effective mass plot to verify that the plateau actually matches what we assert. The rule we just wrote doesn’t do this—it only spits out a numerical answer. Let’s update this rule so that it can also generate the effective mass plot.

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]}"

Rather than having a single string after output:, we now have a block with two lines. Each line has the format name=value, and is followed by a comma. To make use of these variables in our rule, we follow output by a ., and then the name of the variable we want to use, similarly to what we do for wildcards and params.

Let’s test that this works correctly:

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_eff_mass.pdf

Challenge

Non-specificity

What happens if we define multiple named output: variables like this, but refer to the {output} placeholder in the shell: block without specifying a variable name?

(One way to find this out is to try echo {output} as the entire shell: content; this will generate a missing output error, but will first let you see what the output is.)

Show me the solution

Snakemake will provide all of the defined output variables, as a space-separated list. This is similar to what happens when an output variable is a list, as we saw earlier when looking at the expand() function.

Challenge

Flow plots

Update the Wilson flow \(w_0\) computation that we looked at in a previous challenge to also output the flow of \(\mathcal{W}(t)\), so that the shape of the flow may be checked.

Show me the solution

rule w0:
    input: "raw_data/{subdir}/out_wflow"
    output:
        data="intermediary_data/{subdir}/wflow.w0.json.gz",
        plot=multiext(
            "intermediary_data/{subdir}/wflow.W_flow",
            config["plot_filetype"],
	),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.w0 {input} --W0 {config[W0_reference]} --output_file {output.data} --plot_file {output.plot} --plot_styles {config[plot_styles]}"

Again, we can test this with

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/wflow.W_flow.pdf

Multiple inputs

Similarly to outputs, there are many situations where we want to work with more than one class of input file—for example, to combine differing observables into one. For example, the meson_mass rule we wrote previously also outputs the amplitude of the exponential. When combined with the average plaquette via one-loop matching, this can be used to give an estimate of the decay constant. The syntax for this is the same as we saw above for output:.

# Estimate renormalised decay constant
rule one_loop_matching:
    input:
        plaquette="intermediary_data/{subdir}/pg.plaquette.json.gz",
        meson="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
    output:
        data="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.one_loop_matching --plaquette_data {input.plaquette} --spectral_observable_data {input.meson} --output_filename {output.data}"

To test this:

$ snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.corr.ps_decay_const.json.gz
$ cat intermediary_data/beta2.0/pg.corr.ps_decay_const.json.gz | gunzip | head -n 28

OUTPUT

{
 "program": "pyerrors 2.14.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-10-08 20:56:18 +0100",
 "host": "azusa, Linux-6.8.0-85-generic-x86_64-with-glibc2.39",
 "description": {
  "INFO": "This JSON file contains a python dictionary that has been parsed to a list of structures. OBSDICT contains the dictionary, where Obs or other structures have been replaced by DICTOBS[0-9]+. The field description contains the additional description of this JSON file. This file may be parsed to a dict with the pyerrors routine load_json_dict.",
  "OBSDICT": {
   "decay_const": "DICTOBS0"
  },
  "description": {
   "group_family": "SU",
   "num_colors": 2,
   "representation": "fun",
   "nt": 48,
   "nx": 24,
   "ny": 24,
   "nz": 24,
   "beta": 2.0,
   "mass": 0.0,
   "channel": "ps"
  }
 },
 "obsdata": [{
   "type": "Obs",
   "layout": "1",
   "value": [0.06760436978217312],

Callout

Naming things

Even when there is only one output: file, we are still allowed to name it. This makes life easier if we need to add more outputs later, and can make it a little clearer what our intent is when we come to read the workflow later.

Challenge

Spectrum plot

Write a rule that plots the pseudoscalar channel’s decay constant against its mass, for each ensemble studied. The tool src/plot_spectrum.py will help with this.

Try making the filename of the tool a parameter too, so that if the script is changed, Snakemake will correctly re-run the workflow.

Hint: first, write a temporary rule to check the output of

python src/plot_spectrum.py --help

(Or otherwise, create a Conda environment based on envs/analysis.yml, and temporarily activate it to run the command. Remember to deactivate once you’re finished, since otherwise you will no longer have access to snakemake.)

Show me the solution

The help output for plot_spectrum.py is:

OUTPUT

usage: plot_spectrum.py [-h] [--output_filename OUTPUT_FILENAME] [--plot_styles PLOT_STYLES] [--x_observable X_OBSERVABLE]
                        [--y_observable Y_OBSERVABLE] [--zero_x_axis] [--zero_y_axis]
                        datafile [datafile ...]

Plot one state against another for each ensemble

positional arguments:
  datafile              Data files to read and plot

options:
  -h, --help            show this help message and exit
  --output_filename OUTPUT_FILENAME
                        Where to put the plot
  --plot_styles PLOT_STYLES
                        Plot style file to use
  --x_observable X_OBSERVABLE
                        Observables to put on the horizontal axis
  --y_observable Y_OBSERVABLE
                        Observables to put on the vertical axis
  --zero_x_axis         Ensure that zero is present on the vertical axis
  --zero_y_axis         Ensure that zero is present on the vertical axis

Based on this, a possible rule is:

rule spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=expand(
            "intermediary_data/beta{beta}/corr.ps_mass.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),
        ps_decay_const=expand(
            "intermediary_data/beta{beta}/pg.corr.ps_decay_const.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),
    output:
        plot=multiext("assets/plots/spectrum", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]}"

Test this using

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf

Definitely run through the spectrum plot

This plot is referred to from subsequent lessons, so you definitely need to go through it.

Challenge

Scaled spectrum plot

Write a rule that plots both the pseudoscalar channel’s decay constant and the vector mass against the pseudoscalar mass, all scaled by the \(w_0\) scale, for each ensemble studied having \(\beta \le 1.8\). The tool src/plot_spectrum.py will help with this.

Hint: compared to the unscaled spectrum plot, you will additionally need data files for the vector mass and \(w_0\), and will need to pass the --rescale_w0 option to plot_spectrum.py.

Show me the solution

rule spectrum_scaled:
    input:
        script="src/plot_spectrum.py",
        ps_mass=expand(
            "intermediary_data/beta{beta}/corr.ps_mass.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8],
        ),
        v_mass=expand(
            "intermediary_data/beta{beta}/corr.v_mass.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8],
        ),
        ps_decay_const=expand(
            "intermediary_data/beta{beta}/pg.corr.ps_decay_const.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8],
        ),
        w0=expand(
        "intermediary_data/beta{beta}/wflow.w0.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8],
    ),
    output:
        plot=multiext("assets/plots/spectrum_scaled", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.v_mass} {input.ps_decay_const} {input.w0} --y_observable f_ps --y_observable m_v --zero_y_axis --rescale_w0 --output_file {output.plot} --plot_styles {config[plot_styles]}"

Test this using

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/spectrum_scaled.pdf

Log files

When a process run by Snakemake exits with an error code, Snakemake removes all the expected output files. Usually this is what we want: we don’t want to have potentially corrupt output, that might be used as input for subsequent rules. However, there are some classes of output file that are useful in helping to identify what caused the error in the first place: log files.

We can tell Snakemake that specified files are log files, rather than regular output files, by placing them in a log: block rather than an output: one. Snakemake will not delete a file marked as a log if an error is raised by the process generating it.

For example, for the ps_mass rule above, we might use:

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

We can again verify this using

snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json.gz
cat intermediary_data/beta{beta}/corr.ps_mass.log

Since this fit didn’t emit any output on this occasion, the resulting log is empty.

Callout

`2>&1 | tee`

You may recall that | is the pipe operator in the Unix shell, taking standard output from one program and passing it to standard input of the next. (If this is unfamiliar, you may wish to look through the Software Carpentry introduction to the Unix shell when you have a moment.)

Adding 2>&1 means that both the standard output and standard error streams are piped, rather than only standard output. This is useful for a log file, since we will typically want to see errors there.

The tee command “splits a pipe”; that is, it takes standard input and outputs it both to standard output and to the specified filename. This way, we get the log on disk, but also output to screen as well, so we can monitor issues as the workflow runs.

Challenge

Logged plots

Adjust the solution for plotting the spectrum above so that any warnings or errors generated by the plotting script are logged to a file.

Show me the solution

rule spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=expand(
            "intermediary_data/beta{beta}/corr.ps_mass.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),
        ps_decay_const=expand(
            "intermediary_data/beta{beta}/pg.corr.ps_decay_const.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),
    output:
        plot=multiext("assets/plots/spectrum", config["plot_filetype"]),
    log:
        messages="intermediary_data/spectrum_plot.log"
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

Discussion

Logged plots (continued)

We can again verify this using

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf
cat intermediary_data/spectrum_plot.log

Dealing with errors

We’ll end the episode by looking at a common problem that can arise if you mistype a filename in a rule. It may seem silly to break the workflow when we just got it working, but it will be instructive, so let’s modify the Snakefile and deliberately specify an incorrect output filename in the ps_mass rule.

...
shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data}.json_ --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

To keep things tidy, this time we’ll manually remove the intermediary data directory.

BASH

$ rm -rvf intermediary_data

And re-run.

$ snakemake --cores 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json.gz
Assuming unrestricted shared filesystem usage.
host: azusa
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
ps_mass        1
total          1

Select jobs to execute...
Execute 1 jobs...

[Tue Sep  2 00:29:00 2025]
localrule ps_mass:
    input: raw_data/beta2.0/out_corr
    output: intermediary_data/beta2.0/corr.ps_mass.json.gz, intermediary_data/beta2.0/corr.ps_eff_mass.pdf
    log: intermediary_data/beta2.0/corr.ps_mass.log
    jobid: 0
    reason: Forced execution
    wildcards: beta=2.0
    resources: tmpdir=/tmp
Shell command: python -m su2pg_analysis.meson_mass raw_data/beta2.0/out_corr --output_file intermediary_data/beta2.0/corr.ps_mass.json.gz.json_ --plateau_start 11 --plateau_end 21 --plot_file intermediary_data/beta2.0/corr.ps_eff_mass.pdf --plot_styles styles/prd.mplstyle 2>&1 | tee intermediary_data/beta2.0/corr.ps_mass.log
Activating conda environment: .snakemake/conda/7974a14bb2d9244fc9da6963ef6ee6d6_
Waiting at most 5 seconds for missing files:
intermediary_data/beta2.0/corr.ps_mass.json.gz (missing locally)
MissingOutputException in rule ps_mass in file "/home/ed/src/su2pg/workflow/Snakefile", line 68:
Job 0 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
intermediary_data/beta2.0/corr.ps_mass.json.gz (missing locally, parent dir contents: corr.ps_mass.json.gz.json_.json.gz, corr.ps_mass.log, corr.ps_eff_mass.pdf)
Removing output files of failed job ps_mass since they might be corrupted:
intermediary_data/beta2.0/corr.ps_eff_mass.pdf
Shutting down, this might take some time.
Exiting because a job execution failed. Look below for error messages
[Tue Sep  2 00:29:17 2025]
Error in rule ps_mass:
    message: None
    jobid: 0
    input: raw_data/beta2.0/out_corr
    output: intermediary_data/beta2.0/corr.ps_mass.json.gz, intermediary_data/beta2.0/corr.ps_eff_mass.pdf
    log: intermediary_data/beta2.0/corr.ps_mass.log (check log file(s) for error details)
    conda-env: /home/ed/src/su2pg/.snakemake/conda/7974a14bb2d9244fc9da6963ef6ee6d6_
    shell:
        python -m su2pg_analysis.meson_mass raw_data/beta2.0/out_corr --output_file intermediary_data/beta2.0/corr.ps_mass.json.gz.json_ --plateau_start 11 --plateau_end 21 --plot_file intermediary_data/beta2.0/corr.ps_eff_mass.pdf --plot_styles styles/prd.mplstyle 2>&1 | tee intermediary_data/beta2.0/corr.ps_mass.log
        (command exited with non-zero exit code)
Complete log(s): /home/ed/src/su2pg/.snakemake/log/2025-09-02T002859.356054.snakemake.log
WorkflowError:
At least one job did not complete successfully.

There’s a lot to take in here. Some of the messages are very informative. Some less so.

Snakemake did actually run the tool, as evidenced by the output from the program that we see on the screen.
Python is reporting that there is a file missing.
Snakemake complains one expected output file is missing: intermediary_data/beta2.0/corr.ps_mass.json.gz.
The other expected output file intermediary_data/beta2.0/corr.ps_eff_mass.pdf was found but has now been removed by Snakemake.
Snakemake suggests this might be due to “filesystem latency”.

This last point is a red herring. “Filesystem latency” is not an issue here, and never will be, since we are not using a network filesystem. We know what the problem is, as we deliberately caused it, but to diagnose an unexpected error like this we would investigate further by looking at the intermediary_data/beta2.0 subdirectory.

$ ls intermediary_data/beta2.0/
corr.ps_mass.json.gz.json_.json.gz  corr.ps_mass.log

Remember that Snakemake itself does not create any output files. It just runs the commands you put in the shell sections, then checks to see if all the expected output files have appeared.

So if the file names created by your rule are not exactly the same as in the output: block you will get this error, and you will, in this case, find that some output files are present but others (corr.ps_eff_mass.pdf, which was named correctly) have been cleaned up by Snakemake.

Callout

Errors are normal

Don’t be disheartened if you see errors like the one above when first testing your new Snakemake workflows. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off, as we’ll see in the next episode.

Finally, edit the names in the Snakefile back to the correct version and re-run to confirm that all is well.

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf

Key Points

Rules can have multiple inputs and outputs, separated by commas
Use name=value to give names to inputs/outputs
Inputs themselves can be lists
Use placeholders like {input.name} to refer to single named inputs
Where there are multiple inputs, {input} will insert them all, separated by spaces
Use log: to list log outputs, which will not be removed when jobs fail
Errors are an expected part developing Snakemake workflows, and usually give enough information to track down what is causing them

Content from How Snakemake plans jobs

Last updated on 2025-10-29 | Edit this page

Estimated time: 20 minutes

Overview

Questions

How do I visualise a Snakemake workflow?
How does Snakemake avoid unecessary work?
How do I control what steps will be run?

Objectives

View the DAG for our pipeline
Understand the logic Snakemake uses when running and re-running jobs

The DAG

You may have noticed that one of the messages Snakemake always prints is:

OUTPUT

Building DAG of jobs...

A DAG is a Directed Acyclic Graph and it can be pictured like so:

The above DAG is based on three of our existing rules, and shows all the jobs Snakemake would run to compute the pseudoscalar decay constant of the \(\beta = 2.0\) ensemble.

Checklist

Note that:

A rule can appear more than once, with different wildcards (a rule plus wildcard values defines a job)
A rule may not be used at all, if it is not required for the target outputs
The arrows show dependency ordering between jobs
Snakemake can run the jobs in any order that doesn’t break dependency. For example, one_loop_matching cannot run until both ps_mass and avg_plaquette have completed, but it may run before or after count_trajectories
This is a work list, not a flowchart, so there are no if/else decisions or loops. Snakemake runs every job in the DAG exactly once
The DAG depends both on the Snakefile and on the requested target outputs, and the files already present
When building the DAG, Snakemake does not look at the shell part of the rules at all. Only when running the DAG will Snakemake check that the shell commands are working and producing the expected output files

Challenge

How many jobs?

If we asked Snakemake to run one_loop_matching on all eleven ensembles (beta1.5 to beta2.5), how many jobs would that be in total?

Show me the solution

33 in total:

11 \(\times\) one_loop_matching
11 \(\times\) ps_mass
11 \(\times\) avg_plaquette
0 \(\times\) count_trajectories
0 \(\times\) spectrum

Snakemake is lazy, and laziness is good

For the last few episodes, we’ve told you to run Snakemake like this:

snakemake --cores 1 --forceall --printshellcmds --use-conda

As a reminder, the --cores 1 flag tells Snakemake to run one job at a time, --printshellcmds is to print out the shell commands before running them, and --use-conda to ensure that Snakemake sets up the correct Conda environment.

The --forceall flag turns on forceall mode, and in normal usage you don’t want this.

At the end of the last chapter, we generated a spectrum plot by running:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf

Now try without the --forceall option. Assuming that the output files are already created, you’ll see this:

$ snakemake --cores 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
Assuming unrestricted shared filesystem usage.
host: azusa
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

In normal operation, Snakemake only runs a job if:

A target file you explicitly requested to make is missing,
An intermediate file is missing and it is needed in the process of making a target file,
Snakemake can see an input file which is newer than an output file, or
A rule definition or configuration has changed since the output file was created.

The last of these relies on a ledger that Snakemake saves into the .snakemake directory.

Let’s demonstrate each of these in turn, by altering some files and re-running Snakemake without the --forceall option.

$ rm assets/plots/spectrum.pdf
$ snakemake --cores 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
...
Job stats:
job         count
--------  -------
spectrum        1
total           1
...

This just re-runs spectrum, the final step.

$ rm intermediary_data/beta*/corr.ps_mass.json.gz
$ snakemake --cores 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
...
Nothing to be done (all requested files are present and up to date).

“Nothing to be done”. Some intermediate output is missing, but Snakemake already has the file you are telling it to make, so it doesn’t worry.

$ touch raw_data/beta*/out_pg
$ snakemake --cores 1 --printshellcmds --use-conda assets/plots/spectrum.pdf
...
job                  count
-----------------  -------
avg_plaquette           11
one_loop_matching       11
ps_mass                 11
spectrum                 1
...
total                   34

The touch command is a standard Unix command that resets the timestamp of the file, so now the correlators look to Snakemake as if they were just modified.

Snakemake sees that some of the input files used in the process of producing assets/plots/spectrum.pdf are newer than the existing output file, so it needs to run the avg_plaquette and one_loop_matching steps again. Of course, the one_loop_matching step needs the pseudoscalar mass data that we deleted earlier, so now the correlation function fitting step is re-run also.

Explicitly telling Snakemake what to re-run

The default timestamp-based logic is really useful when you want to:

Change or add some inputs to an existing analysis without re-processing everything
Continue running a workflow that failed part-way

In most cases you can also rely on Snakemake to detect when you have edited a rule, but sometimes you need to be explicit, for example if you have updated an external script or changed a setting that Snakemake doesn’t see.

The --forcerun flag allows you to explicitly tell Snakemake that a rule has changed and that all outputs from that rule need to be re-evaluated.

snakemake  --forcerun spectrum --cores 1 --printshellcmds --use-conda assets/plots/spectrum.pdf

Callout

Note on `--forcerun`

Due to a quirk of the way Snakemake parses command-line options, you need to make sure there are options after the --forcerun ..., before the list of target outputs. If you don’t do this, Snakemake will think that the target files are instead items to add to the --forcerun list, and then when building the DAG it will just try to run the default rule.

The easiest way is to put the --cores flag before the target outputs. Then you can list multiple rules to re-run, and also multiple targets, and Snakemake can tell which is which.

BASH

snakemake --forcerun avg_plaquette ps_mass --cores 1 --printshellcmds --use-conda intermediary_data/beta2.0/pg.corr.ps_decay_const.json.gz intermediary_data/beta2.5/pg.corr.ps_decay_const.json.gz

The reason for using the --cores flag specifically is that you pretty much always want this option.

The --force flag specifies that the target outputs named on the command line should always be regenerated, so you can use this to explicitly re-make specific files.

BASH

$ snakemake --cores 1 --force --printshellcmds assets/plots/spectrum.pdf

This always re-runs spectrum, regardless of whether the output file is there already. For all intermediate outputs, Snakemake applies the default timestamp-based logic. Contrast with --forceall, which runs the entire DAG every time.

Visualising the DAG

Snakemake can draw a picture of the DAG for you, if you run it like this:

snakemake --force --dag dot assets/plots/spectrum.pdf | gm display -

Using the --dag option implicitly activates the --dry-run option so that Snakemake will not actually run any jobs, it will just print the DAG and stop. Snakemake prints the DAG in a text format, so we use the gm command to make this into a picture and show it on the screen.

Version differences

Older versions of Snakemake only support outputting the DAG in dot format, so that argument is not needed there.

Callout

Note on `gm display`

The gm command is provided by the GraphicsMagick toolkit. On systems where gm will not display an image directly, you can instead save it to a PNG file. You will need the dot program from the GraphViz package installed.

snakemake --force --dag dot assets/plots/spectrum.pdf | dot -Tpng > dag.png

The boxes drawn with dotted lines indicate steps that are not to be run, as the output files are already present and newer than the input files.

Challenge

Visualising the effect of the `--forcerun` and `--force` flags

Run one_loop_matching on the beta2.0 ensemble, and then use the --dag option as shown above to check:

How many jobs will run if you ask again to create this output with no --force, --forcerun or -forceall options?
How many if you use the --force option?
How many if you use the --forcerun ps_mass option?
How many if you edit the metadata file so that the ps_plateau_start for the \(\beta=2.0\) ensemble is 13, rather than 11?

Show me the solution

This is a way to make the result in the first place:

BASH

$ snakemake --cores 1 --printshellcmds --use-conda intermediary_data/beta2.0/pg.corr.ps_decay_const.json.gz

This command should show three boxes, but all are dotted so no jobs are actually to be run.

BASH

$ snakemake --dag dot intermediary_data/beta2.0/pg.corr.ps_decay_const.json.gz | gm display -

The --force flag re-runs only the job to create the output file, so in this case one box is solid, and only that job will run.
With --forcerun ps_mass, the ps_mass job will re-run, and Snakemake sees that this also requires re-running one_loop_matching, so the answer is 2.

If you see a message like the one below, it’s because you need to put an option after ps_mass or else Snakemake gets confused about what are parameters of --forcerun, and what things are targets.
ERROR
```
WorkflowError:
Target rules may not contain wildcards.
```
Editing the Snakefile has the same effect as forcing the ps_mass rule to re-run, so again there will be two jobs to be run from the DAG.

With older versions of Snakemake this would not be auto-detected, and in fact you can see this behaviour if you remove the hidden .snakemake directory. Now Snakemake has no memory of the rule change so it will not re-run any jobs unless explicitly told to.

Callout

Removing files to trigger reprocessing

In general, getting Snakemake to re-run things by removing files is a bad idea, because it’s easy to forget about intermediate files that actually contain stale results and need to be updated. Using the --forceall flag is simpler and more reliable. If in doubt, and if it will not be too time consuming, keep it simple and just use --forceall to run the whole workflow from scratch.

For the opposite case where you want to avoid re-running particular steps, see the --touch option of Snakemake mentioned later in the lesson.

Key Points

A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
If output files already exist, Snakemake can skip parts of the DAG
Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Content from Optimising workflow performance

Last updated on 2025-10-29 | Edit this page

Estimated time: 30 minutes

Overview

Questions

What compute resources are available on my system?
How do I define jobs with more than one thread?
How do I measure the compute resources being used by a workflow?
How do I run my workflow steps in parallel?

Objectives

Understand CPU, RAM and I/O bottlenecks
Understand the threads declaration
Use common Unix tools to look at resource usage

Processes, threads and processors

Some definitions:

Process: A running program (in our case, each Snakemake job can be considered one process)
Threads: Each process has one or more threads which run in parallel
Processor: Your computer has multiple CPU cores or processors, each of which can run one thread at a time

These definitions are a little simplified, but fine for our needs. The operating system kernel shares out threads among processors:

Having fewer threads than processors means you are not fully using all your CPU cores
Having more threads than processors means threads have to “timeslice” on a core which is generally suboptimal

If you tell Snakemake how many threads each rule will use, and how many cores you have available, it will start jobs in parallel to use all your cores. In the diagram below, five jobs are ready to run and there are four system cores.

Representation of a computer with four microchip icons indicating four available cores. To the right are five small green boxes representing Snakemake jobs and labelled as wanting 1, 1, 1, 2 and 8 threads respectively.

Listing the resources of your machine

So, to know how many threads to make available to Snakemake, we need to know how many CPU cores we have on our machine. On Linux, we can find out how many cores you have on a machine with the lscpu command.

$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  Model name:             Cortex-A72
    Model:                3
    Thread(s) per core:   1

There we can see that we have four CPU cores, each of which can run a single thread.

On macOS meanwhile, we use the command sysctl -n hw.ncpu:

$ sysctl hw.ncpu
hw.ncpu: 8

In this case, we see that this Mac has eight cores.

Likewise find out the amount of RAM available:

BASH

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       1.1Gi       110Mi        97Mi       2.6Gi       2.6Gi
Swap:          199Mi       199Mi        60Ki

In this case, the machine has 3.7GiB of total RAM.

On macOS, the command is sysctl -h hw.memsize:

$ sysctl -h hw.memsize
hw.memsize: 34,359,738,368

This machine has around 34GB of RAM in total. Dividing by the number of bytes in 1GiB (\(1024^3\) bytes), that becomes 32GiB RAM.

We don’t want to use all of this RAM, but if we don’t mind other applications being unresponsive while our workflow runs, we can use the majority of it.

Finally, to check the available disk space, on the current partition:

BASH

$ df -h .

(or df -h without the . to show all partitions) This is the same on both macOS and Linux.

Parallel jobs in Snakemake

You may want to see the relevant part of the Snakemake documentation.

We’ll force all the intermediary steps to re-run by using the --forceall flag to Snakemake and time the whole run using the time command.

BASH

$ time snakemake --cores 1 --use-conda --forceall -- assets/plots/spectrum.pdf
real	3m10.713s
user	1m30.181s
sys	0m8.156s

Challenge

Measuring how concurrency affects execution time

What is the wallclock time reported by the above command? We’ll work out the average for everyone present, or if you are working through the material on your own, repeat the measurement three times to get your own average.

Now change the Snakemake concurrency option to --cores 2 and then --cores 4. Finally, try using every available core on your machine, using --cores all.

How does the total execution time change?
What factors do you think limit the power of this setting to reduce the execution time?

Show me the solution

The time will vary depending on the system configuration but somewhere around 150–200 seconds is expected, and this should reduce to around 75–100 secs with --cores 2 but depending on your computer, higher --cores might produce diminishing returns.

Things that may limit the effectiveness of parallel execution include:

The number of processors in the machine
The number of jobs in the DAG which are independent and can therefore be run in parallel
The existence of single long-running jobs
The amount of RAM in the machine
The speed at which data can be read from and written to disk

There are a few gotchas to bear in mind when using parallel execution:

Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash.
Parallel jobs may trip over each other if they try to write to the same filename at the same time (this can happen with temporary files).
The on-screen output from parallel jobs will be jumbled, so save any output to log files instead.

Multi-thread rules in Snakemake

In the diagram at the top, we showed jobs with 2 and 8 threads. These are defined by adding a threads: block to the rule definition. We could do this for the ps_mass rule:

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    threads: 4
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

You should explicitly use threads: 4 rather than params: threads = "4" because Snakemake considers the number of threads when scheduling jobs. Also, if the number of threads requested for a rule is less than the number of available processors then Snakemake will use the lower number.

Snakemake uses the threads variable to set common environment variables like OMP_NUM_THREADS. If you need to pass the number explicitly to your program, you can use the {threads} placeholder to get it.

Callout

Fine-grained profiling

Rather than timing the entire workflow, we can ask Snakemake to benchmark an individual rule.

For example, to benchmark the ps_mass step we could add this to the rule definition:

rule ps_mass:
    benchmark:
        "benchmarks/ps_mass.beta{beta}.txt"
    ...

The dataset here is so small that the numbers are tiny, but for real data this can be very useful as it shows time, memory usage and IO load for all jobs.

Running jobs on a cluster

Learning about clusters is beyond the scope of this course, but can be essential for more complex workflows working with large amounts of data.

When working with Snakemake, there are two options to getting the workflow running on a cluster:

Similarly to most tools, we may install Snakemake on the cluster, write a job script, and execute Snakemake on our workflow inside a job.
We can teach Snakemake how to run jobs on the cluster, and run our workflow from our own computer, having Snakemake do the work of submitting and monitoring the jobs for us.

To run Snakemake in the second way, someone will need to determine the right parameters for your particular cluster and save them as a profile. Once this is working, you can share the profile with other users on the cluster, so discuss this with your cluster sysadmin.

Instructions for configuring the Slurm executor plugin can be found in the Snakemake plugin catalog, along with the drmaa, cluster-generic and cluster-sync plugins which can support PBS, SGE and other cluster schedulers.

A photo of some high performance computer hardware racked in five cabinets in a server room. Each cabinet is about 2.2 metres high and 0.8m wide. The doors of the cabinets are open to show the systems inside. Orange and yellow cabling is prominent, connecting ports within the second and third racks.

Running on cluster and cloud

Running workflows on HPC or Cloud systems could be a whole course in itself. The topic is too important not to be mentioned here, but also complex to teach because you need a cluster to work on.

If you are teaching this lesson and have institutional HPC then ideally you should liaise with the administrators of the system to make a suitable installation of a recent Snakemake version and a profile to run jobs on the cluster job scheduler. In practise this may be easier said than done!

If you are able to demonstrate Snakemake running on cloud as part of a workshop then we’d much appreciate any feedback on how you did this and how it went.

Callout

Cluster demo

A this point in the course there may be a cluster demo…

Key Points

To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
You also need to consider RAM, disk, and network bottlenecks
Profile your jobs to see what is taking most resources
Use --cores all to enable using all CPU cores
Snakemake is great for running workflows on compute clusters

Content from Awkward corners

Last updated on 2025-10-29 | Edit this page

Estimated time: 55 minutes

Overview

Questions

How can I look up metadata based on wildcard values?
How can I select different numbers of input files depending on wildcard values?
How can I tell Snakemake not to regenerate a file?

Objectives

Understand how Snakemake being built on Python allows us to work around some shortcomings of Snakemake in some use cases.
Understand how to handle trickier metadata and input file lookups.
Be able to avoid Snakemake re-running a rule when this is not wanted.

Beyond the pseudoscalar: Input functions

Recall the rule that we have been working on to fit the correlation function of the pseudoscalar meson:

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

Now, we are frequently interested in more symmetry channels than just the pseudoscalar. In principle, we could make a copy of this rule, and change ps to, for example v or av. (We did this in an earlier challenge for the vector channel.) However, just like adding more ensembles, this rapidly makes our Snakefiles unwieldy and difficult to maintain. How can we adjust this rule so that it works for any channel?

The first step is to replace ps with a wildcard. The output should then use corr.{channel}_mass.json rather than corr.ps_mass.json, and similarly for the log. We will also need to pass an argument --channel {wildcards.channel} in the shell block, so that the code knows what channel it is working with. Since the rule is no longer only for the ps channel, it should be renamed, for example to meson_mass.

What about the plateau start and end positions? Currently these are found using a call to lookup, with the column hardcoded to "ps_plateau_start" or "ps_plateau_start". We can’t substitute a wildcard in here, so instead we need to make use of an input function.

When Snakemake is given a Python function as a param, input, or output, then it will call this function, passing in the parsed wildcards, and use the result of the function call as the value. For example, I could define a function:

PYTHON

def get_mass(wildcards):
    return 1.0

and then make use of this within a rule as:

rule test_mass:
    params:
        mass=get_mass,
    ...

This would then set the parameter params.mass to 1.0.

How can we make use of this for getting the plateau metadata?

PYTHON

def plateau_param(position):
    """
    Return a function that can be used to get a plateau position from the metadata.
    `position` should be `start` or `end`.
    """
    def get_plateau(wildcards):
        return lookup(
            within=metadata,
            query="beta == {beta}",
            cols=f"{wildcards['channel']}_plateau_{position}",
        )

    return get_plateau

Working inside out, you will recognise the lookup call from before, but the cols argument is now changed: we make use of Python’s f-strings to insert both the channel and the position (start or end). We wrap this into a function called get_plateau, which only takes the wildcards as an argument. The position is defined in the outer function, which creates and returns the correct version of get_plateau depending on which position is specified. We need to do this because Snakemake doesn’t give a direct way to specify additional arguments to the input function.

To use this function within the rule, we can use plateau_start=plateau_param("start") in the params block for the plateau start position, and similarly for the end.

Closure alternatives

Here we choose to use a closure (a function returned by another function, where the former’s behaviour depends on the arguments to the latter), so that the same code can be used for both the start and the end of the plateau. There are other ways to phrase this: you could define a free function get_plateau(wildcards, position), and then in the rule definition, use functools.partial(get_plateau, position=...) to set the position, or use a lambda lambda wildcards: get_plateau(wildcards, ...). We choose the closure here because defining functions should already be familiar to most learners, and passing functions as values needs to be learned anyway (since we have to pass one to Snakemake), whereas lambdas and functools may not be familiar, and aren’t needed elsewhere in the lesson.

With these changes, the full rule now becomes:

# Compute meson mass and amplitude, read plateau from metadata,
# and plot effective mass
rule meson_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.{channel}_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.{channel}_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.{channel}_mass.log",
    params:
        plateau_start=plateau_param("start"),
        plateau_end=plateau_param("end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --channel {wildcards.channel} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"

We can test this for the vector mass:

snakemake --use-conda --cores all --printshellcmds intermediary_data/beta2.0/corr.v_mass.json.gz

Change of command line

If learners skipped over the previous section, note that we’ve changed the standard snakemake call we were previously using: now, we don’t use --forceall, so we only regenerate when necessary (which makes the run quicker). --cores all, meanwhile, tells Snakemake to use all available CPU cores. In this case it doesn’t make a difference, since only one job is needed by this run, but it’s a useful default invocation for production use.

Challenge

Restricted spectrum

Define a rule restricted_spectrum that generates a plot of the pseudoscalar decay constant against its mass (both in lattice units), that only plots data where \(\beta\) is below a specified value beta0, which is included in the filename.

Use this to test plotting the spectrum for \(\beta < 2.0\).

Hint: Similarly to the above, you may want to define a function that defines a function, where the former takes the relevant fixed part of the filename (corr.ps_mass or pg.corr.ps_decay_const) as input, and the latter the wildcards.

Show me the solution

We can constrain the data plotted by constraining the input data. Similarly to for the example above, we can use an input function for this, but this time for the input data rather than to define params.

def spectrum_param(slug):
    def spectrum_inputs(wildcards):
        return [
            f"intermediary_data/beta{beta}/{slug}.json.gz"
            for beta in metadata[metadata["beta"] < float(wildcards["beta0"])]["beta"]
        ]

    return spectrum_inputs


rule restricted_spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=spectrum_param("corr.ps_mass"),
        ps_decay_const=spectrum_param("pg.corr.ps_decay_const"),
    output:
        plot=multiext("assets/plots/spectrum_beta{beta0}", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]}"

To generate the requested plot would then be:

snakemake --use-conda --cores all --printshellcmds assets/plots/spectrum_beta2.0.pdf

Globbing

So far, our workflow has been entirely deterministic in what inputs to use. We have referred to parts of filenames that are substituted in based on the requested outputs as “wildcards”. You might be familiar with another meaning of the word “wildcard”, however: the * and ? characters that you can use in the shell to match any number of and one or more characters respectively.

In general we would prefer to avoid using these in our workflows, since we would like them to be reproducible. If a particular input file is missing, we would like to be told about it, rather than the workflow silently producing different results. (This also makes the failure cases easier to debug, since otherwise we can hit cases where we call our tools with an unexpected number of inputs, and they fail with strange errors.)

However, occasionally we do find ourselves needing to perform shell-style wildcard matches, also known as globbing. For example, let’s say that we would like to be able to plot only the data that we have already computed, without regenerating anything. We can perform a wildcard glob using Python’s glob function.

At the top of the file, add:

from glob import glob

Then we can add a rule:

rule quick_spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=glob(
            "intermediary_data/beta*/corr.ps_mass.json.gz",
        ),
        ps_decay_const=glob(
            "intermediary_data/beta*/pg.corr.ps_decay_const.json.gz",
        ),
    output:
        plot=multiext("intermediary_data/check_spectrum", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]}"

Let’s test this now

snakemake --use-conda --cores all --printshellcmds intermediary_data/check_spectrum.pdf

If you have recently purged the workflow output, then this might raise an error as there are no data to plot. Otherwise, opening this resulting file in your PDF viewer, you’ll most likely see a plot with a reduced number of points compared to previous plots. If you’re lucky, then all the points will be present.

Callout

Use with care in production!

This rule will only include data already present at the start of the workflow run in your plots. This means if you start from clean, as people looking to reproduce your work will, then it will fail, and even if you start from a partial run, then some points will (non-deterministically) be omitted from your plots.

For final plots based on intermediary data, always specify the input explicitly.

Marking files as up to date

One of the useful aspects of Snakemake’s DAG is that it will automatically work out which files need updating based on changes to their inputs, and re-run any rules needed to update them. Occasionally, however, this isn’t what we want to do.

Consider the following example:

Diagram showing three boxes in a row, connected from left to right by arrows. The boxes are labeled "`input_file`", "`data_file`", and "`output_file`". The arrows are labelled "`generate`" (24 hours), and "`plot`" (15 seconds).

We have two steps to this workflow: taking the input data and from it generating some intermediary data, requiring 24 hours. Then we take the resulting data and plot results from it, taking 15 seconds. If we make a trivial change to the input file, such as reformatting white space, we’d like to be able to test the output stage without needing to wait 24 hours for the data to be regenerated.

Further, if we share our workflow with others, we’d like them to be able to reproduce the easy final stages without being required to run the more expensive early ones. This especially applies where the input data are very large, and the intermediary data smaller; this reduces the bandwidth and disk space required for those reproducing the work.

We can achieve this by telling Snakemake to “touch” the relevant file. Similar to the touch shell command, which updates the modification time of a file without making changes, Snakemake’s --touch option updates Snakemake’s records of when a file was last updated, without running any rules to update it.

For example, in the above workflow, we may run

snakemake --touch data_file
snakemake --cores all --use-conda --printshellcmds plot_file

This will run only the second rule, not the first. Thiis will be the same regardless of whether input_file is even present, or when it was last updated.

Key Points

Use Python input functions that take a dict of wildcards, and return a list of strings, to handle complex dependency issues that can’t be expressed in pure Snakemake.
Import glob.glob to match multiple files on disk matching a specific pattern. Don’t rely on this finding intermediate files in final production workflows, since it won’t find files not present at the start of the workflow run.
Use snakemake --touch if you need to mark files as up-to-date, so that Snakemake won’t try to regenerate them.

Content from Tidying up

Last updated on 2025-10-29 | Edit this page

Estimated time: 35 minutes

Overview

Questions

How do I split a Snakefile into manageable pieces?
How do I avoid needing to list every file to generate in my snakemake call?
What should I bear in mind when using Git for my Snakemake workflow?
What should I include in the README for my workflow?

Objectives

Be able to avoid having long monolithic Snakefiles
Be able to have Snakemake generate a set of targets without explicitly specifying them
Understand how to effectively use Git to version control Snakemake worfklows
Be able to write an effective README on how to use a workflow

Breaking up the Snakefile

So far, we have written all of our rules into a single long file called Snakefile. As we continue to add more rules, this can start to get unwieldy, and we might want to break it up into smaller pieces.

Let’s do this now. We can take the rules relating only to the pg output files, and place them into a new file workflow/rules/pg.smk. Since conda: directives are defined relative to the current file, we need to replace envs/analysis.yml with ../envs/analysis.yml when we do this.

In place of the rules we just moved, in the Snakefile, we add the line

include: "rules/pg.smk"

This tells Snakemake to take the contents of the new file we created, and place it at that point in the file. Unlike Python, where the import statement creates a new scope, in Snakemake, anything defined above the include line is available to the included code. So we are safe to use the configuration parameters and metadata that we load at the top of the file.

Challenge

Clean out the `Snakefile`

Sort the remaining rules in the Snakefile into additional .smk files. Place these in the rules subdirectory with pg.smk.

One possible breakdown to use would be

Rules relating to correlation function fits
Rules relating to the Wilson flow
Rules combining output of more than one ensemble

Hint: conda: environments are defined relative to the source file, so they will need to be adjusted with a ../.

Show me the solution

One option is to have four files in workflow/rules:

pg.smk, containing the rules count_trajectories and avg_plaquette
corr_fits.smk, containing the rules meson_mass and one_loop_matching
wflow.smk, containing the rule w0
output.smk, containing the rules plot_avg_plaquette, tabulate_counts, restricted_spectrum, spectrum, and quick_spectrum.

In each one, any conda: directives should point to ../envs/analysis.yml.

Challenge

Tidy up the `plot_styles`

You might have noticed that we’re using the plot_styles configuration option as an argument to the plotting rules, but without including it in the input blocks. Since the argument is a filename, it is a good idea to let Snakemake know about this, so that if we adjust the style file, Snakemake knows to re-run the plotting rules.

Make that change now for all rules depending on plot_styles.

Show me the solution

The input: block should contain a new line similar to:

        plot_styles=config["plot_styles"],

The shell: block should then replace references to {config[plot_styles]} with {input.plot_styles}.

Challenge

Use the `metadata`

Currently a lot of our rules list a lot of values of beta. Can you tidy this up so that it uses the values from the metadata file instead?

Show me the solution

Each instance of beta= inside an expand may be replaced with:

beta=sorted(set(metadata["beta"]))

Alternatively, you may define this as a variable at the top of the file, and use it by name, rather than repeating it every time.

For the spectrum_scaled rule, you will additionally need to filter the metadata, for example via

beta=sorted(set(metadata[metadata["beta"] < 1.9]["beta"]))

A default target

When we come to run the full workflow to generate all of the assets for our publication, it is frustrating to need to list every single file we would like. It is much better if we can do this as part of the workflow, and ask Snakemake to generate a default set of targets.

We can do this by adding the following rule above all others in our Snakefile:

tables = expand("assets/tables/{table}.tex", table=["counts"])
plots = expand(
    f"assets/plots/{{plot}}{config['plot_filetype']}",
    plot=["plaquette_scan", "spectrum"],
)


rule all:
    input:
        plots=plots,
        tables=tables,
    default_target: True

Unlike the rules we looked at in the previous section, this one should stay in the main Snakefile. Note that the rule only has inputs: Snakemake sees that those files must be present for the rule to complete, so runs the rules necessary to generate them. When we call Snakemake with no targets, as

snakemake --cores all --use-conda

then Snakemake will look to the all rule, and make assets/tables/counts.tex, assets/plots/plaquette_scan.pdf, and assets/plots/spectrum.pdf, along with any intermediary files they depend on.

Callout

It can also be a good idea to add a provenance stamp at this point, where you create an additional data file listing all outputs the workflow generated, along with hashes of their contents, to make it more obvious if any files left over from previous workflow runs sneak into the output.

Using Snakemake with Git

Using a version control system such as Git is good practice when developing software of any kind, including analysis workflows. There are some steps that we can take to make our workflows fit more nicely into Git.

Firstly, we’d like to avoid committing files to the repository that aren’t part of the workflow definition. We make use of the file .gitignore in the repository root for this. In general, our Git repository should only contain things that should not change from run to run, such as the workflow definition and any workflow-specific code. Some examples of things that probably shouldn’t be committed include:

Input data and metadata. These should live and be shared separately from the workflow, as they are inputs to the analysis rather than a dedicated part of the analysis workflow. This means that someone wanting to read the code doesn’t need to download gigabytes of unwanted data.
Intermediary and output files. You may wish to share these with readers, and the output plots and tables should be included in your papers. But if a reader wants these from the workflow, they should download the input data and workflow, and run the latter. Mixing generated files with the repository will confuse matters, and make it so that any workflow re-runs create unwanted “uncommitted changes” that Git will notify you about.
.snakemake directory. Similarly to the intermediary and output files, this will change from machine to machine and should not be committed. (In particular, it will get quite large with Conda environments, which are only useful on the computer they were created on.)

Callout

If you quote data from another paper, and have had to transcribe this into a machine-readable format like CSV yourself, then this may be included in the repository, since it is not a data asset that you can reasonably publish in a data repository under your own name. In this case, the data’s provenance and attribution should be clearly stated.

Let’s check the .gitignore for our workflow now.

$ cat .gitignore
# Common temporary files
thumbs.db
.DS_Store

# Python compiled objects
*.pyc
__pycache__/
.mypy_cache/

# Snakemake cache
.snakemake/

# Workflow outputs
assets/
data_assets/
intermediary_data/

# Workflow inputs that should not be held together with code
raw_data/*
metadata/*

# Common editor temporary files
*~
.#*#

You can see that all of the above are being ignored, as are some common files to sneak into Git repositories—temporary files generated by text editors and cache files from operating systems, for example. This forms a good basic template for your own .gitignore files for Snakemake workflows; you may want to expand yours based on GitHub’s templates, for example.

Someone downloading your workflow will need to place the input data and metadata in the correct locations. You might wish to prepare an empty directory for the reader to place the necessary files into Since Git does not track empty directories, only files, we create an empty file in it, and update the .gitignore rules to un-ignore only that file.

touch metadata/.git_keep
nano .gitignore

OUTPUT

# Don't ignore placeholder files for would-be empty directories
!*/.git_keep

Let’s commit these changes now

git add .gitignore
git add metadata/.git_keep
git commit -m "Prepare empty directory for metadata"

Where you use libraries that you have written, that are not on the Python Package Index, it is a good idea to incorporate these as Git submodules in your workflow. While pip can install packages from GitHub repositories, this is not robust against you moving or closing your GitHub account later, or GitHub stopping offering free services. Instead, you can make available a ZIP file containing the workflow and its primary dependencies, which will continue to be installable significantly further into the future.

Currently we have a directory libs/su2pg_analysis, containing the library we have used for most of this analysis. This library is hosted on GitHub at https://github.com/edbennett/su2pg_analysis. To track this as a Git submodule, we use the command

$ git submodule add https://github.com/edbennett/su2pg_analysis libs/su2pg_analysis
Adding existing repo at 'libs/su2pg_analysis' to the index
$ git commit -m "Add su2pg_analysis as submodule"
[main 89c3d3d] Add su2pg_analysis as submodule
 2 files changed, 4 insertions(+)
 create mode 160000 libs/su2pg_analysis

We also need to commit the work we’ve done on the workflow itself. (We’ve put this off while we’ve been learning, but in general it’s a good idea to do this regularly during development, not only when you’ve “finished”.)

git add workflow/ config/
git status

If we’re happy with what is to be committed, we can then run

git commit -m "first draft of complete workflow"

Challenge

What to include?

Which of the following files should be included in a hypothetical workflow repository?

sort.py, a tool for putting results in the correct order to show in a paper.
out_corr, a correlation function output file.
Snakefile, the workflow definition.
.snakemake/, the Snakemake cache directory.
spectrum.pdf, a plot created by the workflow.
prd.mplstyle, a Matplotlib style file used by the workflow.
README.md, guidelines on how to use the workflow.
id_rsa, the SSH private key used to connect to clusters to run the workflow.

Show me the solution

Yes, this tool is clearly part of the workflow, so should be included in the repository, unless it’s being used from an external library.
No, this is input data, so is not part of the software. This should be part of a data release, as otherwise a reader won’t be able to run the workflow, but should be separate to the Git repository.
Yes, our workflow release wouldn’t be complete without the workflow definition.
No, this directory contains files that are specific to your computer, so aren’t useful to anyone else.
No, this will change from run to run, and is not part of the software. A reader can regenerate it from your code and data, by using the workflow.
Yes, while this is input to the workflow, it does not change based on the input data, and isn’t generated from the physics, so forms part of the workflow itself.
Yes, this is essential for a reader to be able to understand how to use the workflow, so should be part of the repository.
No, this is private information that may allow others to take over or abuse your supercomputing accounts. Be very careful not to put private information in public Git repositories!

`README`

In general, we would like others to be able to reproduce our work—this is a key aspect of the scientific method. To this end, it’s important to let them know how to use our workflow, to minimise the amount of effort that must be spent getting it running. By convention, this is done in a file called README, (with an appropriate file extension).

In this context, it’s good to remember that “other people” includes “your collaborators” and “yourself in six months’ time”, so writing a good README isn’t just good citizenship to help others, it’s also directly beneficial to you and those you work with.

One good format to write a README in is Markdown. This is rendered to formatted text automatically by most Git hosting services, including GitHub. When using this, the file is conventionally called README.md.

Things you should include in your README include:

A brief statement of what the workflow does, including a link to the paper it was written for.
A listing of what software is required, including links for installation instructions or downloads. (Snakemake is one such piece of software.)
Instructions on downloading the repository and necessary data, including a link to where the data can be found.
Instructions on how to run the workflow, including the suggested snakemake invocation.
An indication of the time required to run the workflow, and the hardware that that estimate applies to.
Details of where output from the workflow is placed.
An indication of how reusable the workflow is: is it intended to work with any set of input data, and has it been tested for this, or is it designed for the specific dataset being analysed?

Rather than writing from fresh, you may wish to work from a template for this. One suitable README template can be found in the TELOS Collaboration’s workflow template

Discussion

Write a README

Use the TELOS Collaboration template to draft a file README.md for the workflow we have developed.

Key Points

Use .smk files in workflow/rules to compartmentalise the Snakefile, and use input: lines in the main Snakefile to link them into the workflow.
Add a rule at the top of the Snakefile with default_target: True to specify the default output of a workflow.
Use .gitignore to avoid committing input or output data or the Snakemake cache.
Use .git_keep files to preserve empty directories.
Use Git submodules to link to libraries you have written that aren’t on PyPI.
Include a README.md in your repository explaining how to run the workflow.

Content from Publishing your workflow

Last updated on 2025-10-29 | Edit this page

Estimated time: 15 minutes

Overview

Questions

How do I verify that my workflow is ready to upload?
How do I prepare a single archive of my workflow and its dependencies?

Objectives

Understand the need to test a workflow
Be able to package a workflow for upload

So far, we’ve talked about how to develop a workflow reliably and consistently reproduces the same results. But at some point we would like to publish those results. The scientific method demands that we enable others to reproduce our work. While in principle, they might be able to do this just from the equations and references we write in our paper, in practice we can never provide enough detail to encode every decision our code makes, and so we must publish our workflow to enable reproducibility by others.

In the previous episode, we discussed keeping the workflow in Git. While Git repositories can be made accessible to others through services like GitHub, this is only really suitable for code that is under active development; for long term, static retention, it is better to use a service dedicated to providing that.

In particular, you want references to the workflow in publications to be accessible arbitrarily far in the future. If I see a citation in a paper to a paper from 1955, I can go to the library and read it, or find it on the journal’s website. Most services designed for active use can’t provide guarantees of longevity; even if the resources remain available, their address might change. For ongoing work, this is a minor inconvenience, but for a published paper, we can’t reasonably go back and edit every citation to every workflow years later.

For that reason, data, software, and workflows should be published in an appropriate data or repository. In some fields there are specific repositories focusing on the needs of that community; in lattice, currently there is none. However, for sharing most workflows and modest data (up to 50GB, or 200GB on specific request), a suitable general-purpose repository is Zenodo. This is run by CERN, and is committed to remain available for the life of the CERN datacentre. It provides a DOI for each uploaded dataset, which can be cited in papers in the same way as the DOI for a journal article.

Testing the workflow

Before releasing the workflow to the world, it’s important to verify that it does correctly reproduce the expected outputs. Since you will have been using it for this in development, this would hopefully be automatic, but you can find situations where existing files left over from earlier in development allow the workflow to complete in your working directory, but not in a freshly-cloned version.

Let’s use Git to make a fresh copy of the workflow:

cd ../
git clone --recurse-submodules su2pg su2pg_test
cd su2pg_test

Now, we follow the steps in the README directly. We obtain the data and metadata, and place them into the correct locations. In practice we would want to have the data already in the ZIP format that we will use to upload to Zenodo; for now, we can copy this across:

cp -r ../su2pg/raw_data .
cp ../su2pg/metadata/* metadata/

Now we can re-run Snakemake:

snakemake --cores all --use-conda

Since this is a clean copy, Snakemake will once again need to install the Conda environment used by some of the workflow steps.

If Snakemake exits without errors, we can do a quick sense check that the outputs we get match the ones we expect; for example, looking at plots side-by-side. For text files, we can also check the differences explicitly:

diff -r ../su2pg/assets/tables assets/tables

We expect to see differences in provenance, but not in numbers, when we run on the same computer.

Callout

If you can, it’s a good idea to test a different architecture to the one you developed on, and to get a collaborator unfamiliar with the workflow to test the instructions. For example, if you developed the workflow on Linux, then get a collaborator with a Mac to follow the README without your direct guidance.

This will likely raise some issues. In some cases the Conda environment will need to be adjusted to be cross-platform compatible. Note that numerical outputs are not expected to be identical to machine precision when switching between different platforms, due to differences in numerics between different CPUs. However, there should be no visible differences in plots, and tables should be consistent to the last printed significant figure.

Preparing a ZIP file

Since we cannot upload directories to Zenodo, we must prepare a ZIP archive of the full set of code. While GitHub does provide a ZIP version of repositories, it excludes the contents of submodules, instead leaving an empty directory, which is not very useful to potential readers. Instead, we must do this preparation by hand.

Instructor Note

In principle, another archive like tar.gz could also be used. However, Zenodo can preview the contents of ZIP files, and not of other archives, so ZIP is strongly recommended here.

We again start from a clean clone, ensuring that we have all submodules present (since they may not be available from GitHub later).

cd ..
git clone --recurse-submodules su2pg su2pg_release
zip -9 --exclude "**/.git/*" --exclude "**/.git" -r su2pg.zip su2pg_release

Note that we use --exclude to avoid committing the .git directory of the repository or its submodules. We don’t get the .git directory in a GitHub ZIP file either, and the reader doesn’t need it: they only require the final version used to generate the work in the paper, not the full history.

At this point, you should be ready to upload to Zenodo. Completing the upload form is beyond the scope of this lesson, but more detail can be found in the TELOS Collaboration reproducibility and open science guide, or in the lesson Publishing your Data Analysis Code.

Key Points

Test the workflow by making a fresh clone and following the README instructions
Use zip -9 --exclude "**/.git/*" --exclude "**/.git" filename.zip dirname to prepare a ZIP file of a freshly-cloned repository, with submodules.

Overview

Questions

Objectives

Introduction

Environment activation

Looking at the sample data

Making a Snakefile

Key points about this file

Activate your environment

Directories

Read-only data

Use of the --forceall flag

Running Snakemake

Show me the solution

Counting trajectories

OUTPUT

BASH

Counting sequences in Snakemake

Show me the solution

Overview

Questions

Objectives

Why define an environment

A basic environment definition

YAML

Using an environment definition in a Snakefile

OUTPUT

More plaquettes

Show me the solution

Overview

Questions

Objectives

Making rules more generic

Comments in Snakefiles

General plaquette computation

Show me the solution

Choosing the right wildcards

Show me the solution

Filenames aren’t data

Snakemake order of operations

OUTPUT

Dry-run (--dry-run) mode

Instructor Note

Overview

Questions

Objectives

A pipeline of multiple rules

Choosing file name patterns

Making file lists easier

Tabulating trajectory counts

Solution 1

Tabulating trajectory counts (continued)

Solution 2

Overview

Questions

Objectives

Global parameters

YAML

Wilson flow

Show me the solution

YAML

Generating different filetypes

YAML

Changing configurations

Show me the solution

YAML

Metadata from a file

Instructor Note

Vector mass

Show me the solution

Overview

Questions

Objectives

Multiple outputs

Non-specificity

Show me the solution

Flow plots

Show me the solution

Multiple inputs

OUTPUT

Dry-run (`--dry-run`) mode

`2>&1 | tee`

Note on `--forcerun`

Note on `gm display`

Visualising the effect of the `--forcerun` and `--force` flags