Content from Running commands with Snakemake


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How do I run a simple command with Snakemake?

Objectives

  • Create a Snakemake recipe (a Snakefile)
  • Use Snakemake to compute the average plaquette of an ensemble

Introduction


Data analysis in lattice quantum field theory generally has many moving parts: you will likely have many ensembles, with differing physical and algorithmic parameters, and for each many different observables may be computed. These need to be combined in different ways, making sure that compatible statistics are used. Making sure that each step runs in the correct order is non-trivial, requiring careful bookkeeping, especially if we want to update data as ensembles are extended, if we want to take advantage of parallelism to get results faster, and if we want auditability to be able to verify later what steps were performed.

While we could build up tools to do all of these things from scratch, these are not challenges exclusive to lattice, and so we can take advantage of others’ work rather than reinventing the wheel. This frees up our time to focus on the physics challenges. The category of “tools to help run complex arrangements of tools in the right order” is called “workflow management”; there are workflow managers available, most of which are specialised to a specific class of applications.

One workflow manager developed for scientific data analysis is called Snakemake; this will be the target of this lesson. Snakemake is similar to GNU Make, in that you create a text file containing rules specifying how input files are translated to output files, and then the software will work out what rules to run to generate a specified output from the available input files. Unlike Make, Snakemake uses a syntax closely based on Python, and files containing rules can be extended using standard Python syntax. It also has many quality-of-life improvements compared to Make, and so is much better suited for writing data analysis workflows.

At this point, you should have Snakemake already installed and available to you. To test this, we can open a terminal and run

$ snakemake --version
8.25.3

If you instead get a “command not found” error, go back to the setup and check that you have completed all the necessary steps.

Looking at the sample data


You should already have the sample data files unpacked. (If not, refer back to the lesson setup.) Under the su2pg/raw_data directory, you will find a series of subdirectories, each containing data for a single ensemble. In each are files containing the log of the configuration generation, the computation of the quenched meson spectrum, and the computation of the Wilson flow.

The sample data are for the SU(2) pure Yang-Mills theory, and have been generated using the HiRep code.

Each log contains header lines describing the setup, information on the computation being computed, and results for observables computed on each configuration. Code to parse these logs and compute statistics is included with the sample data; we’ll use these in due course

Making a Snakefile


To start with, let’s define a rule to count the number of lines in one of the raw data files.

Within the su2pg/workflow directory, edit a new text file named Snakefile. Into it, insert the following content:

rule count_lines:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "wc -l raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

Key points about this file

  1. The file is named Snakefile - with a capital S and no file extension.
  2. Some lines are indented. Indents must be with space characters, not tabs.
  3. The rule definition starts with the keyword rule followed by the rule name, then a colon.
  4. We named the rule count_trajectories. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword.
  5. The keywords input, output, shell are all followed by a colon.
  6. The file names and the shell command are all in "quotes".

The first line tells Snakemake we are defining a new rule. Subsequent indented lines form a part of this rule; while there are none here, any subsequent unindented lines would not be included in the rule. The input: line tells Snakemake what files to look for to be able to run this rule. If this file is missing (and there is no rule to create it), Snakemake will not consider running this rule. The output: line tells Snakemake what files to expect the rule to generate. If this file is not generated, then Snakemake will abort the workflow with an error. Finally, the shell: block tells Snakemake what shell commands to run to get the specified output from the given input.

Going back to the shell now, we can test this rule. From the su2pg directory, we can run the command

snakemake --jobs 1 --forceall --printshellcmds intermediary_data/beta2.0/pg.count

If we’ve made any transcription errors in the rule (missing quotes, bad indentations, etc.), then it will become clear at this point, as we’ll receive an error that we will need to fix.

For now, we will consistently run snakemake with the --jobs 1 --forceall --printshellcmds options. As we move through the lesson, we’ll explain in more detail when we need to modify them.

Let’s check that the output was correctly generated:

$ cat intermediary_data/beta2.0/pg.count
  31064 raw_data/beta2.0/out_pg

Callout

You might have noticed that we are grouping files into directories like raw_data and intermediary_data. It is generally a good idea to keep raw input data separate from data generated by the analysis. This means that if you need to run a clean analysis starting from your input data, then it is much easier to know what to remove and what to keep. Ideally, the raw_data directory should be kept read-only, so that you don’t accidentally modify your input data. Similarly, it is a good idea to separate out “files that you want to include in a paper” from “intermediary files generated by the workflow but not needed in the paper”; we’ll talk more about that in a later section.

You might also worry that your tooling will need to use mkdir to create these directories; in fact, Snakemake will automatically create all directories where it expects to see output from rules that it runs.

In the first few episodes we always run Snakemake with the --forceall flag, and it’s not explained what this does until Ep. 04. The rationale is that the default Snakemake behaviour when pruning the DAG leads to learners seeing different output (typically the message “nothing to be done”) when repeating the exact same command. This can seem strange to learners who are used to scripting and imperative programming.

The internal rules used by Snakemake to determine which jobs in the DAG are to be run, and which skipped, are pretty complex, but the behaviour seen under --forceall is much more simple and consistent; Snakemake simply runs every job in the DAG every time. You can think of --forceall as disabling the lazy evaluation feature of Snakemake, until we are ready to properly introduce and understand it.

Running Snakemake

Run snakemake --help | less to see the help for all available options. What does the --printshellcmds option in the snakemake command above do?

  1. Protects existing output files
  2. Prints the shell commands that are being run to the terminal
  3. Tells Snakemake to only run one process at a time
  4. Prompts the user for the correct input file

Hint: you can search in the text by pressing /, and quit back to the shell with q

  1. Prints the shell commands that are being run to the terminal

This is such a useful thing we don’t know why it isn’t the default! The --jobs 1 option is what tells Snakemake to only run one process at a time, and we’ll stick with this for now as it makes things simpler. The --forceall option tells Snakemake to always recreate output files, and we’ll learn about protected outputs much later in the course. Answer 4 is a total red herring, as Snakemake never prompts interactively for user input.

Counting trajectories


The count of output lines isn’t particularly useful. Potentially more interesting is the number of trajectories in a given file. In a HiRep generation log, each trajectory concludes with a line of the form

OUTPUT

[MAIN][0]Trajectory #1: generated in [39.717707 sec]

We can use grep to count these, as

BASH

grep -c generated raw_data/beta2.0/out_pg

Counting sequences in Snakemake

Modify the Snakefile to count the number of trajectories in raw_data/beta2.0/out_pg, rather than the number of lines.

  • Rename the rule to count_trajectories
  • Keep the output file name the same
  • Remember that the result needs to go into the output file, not just be printed on the screen
  • Test the new rule once it is done.
rule count_trajectories:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

Key Points

  • Before running Snakemake you need to write a Snakefile
  • A Snakefile is a text file which defines a list of rules
  • Rules have inputs, outputs, and shell commands to be run
  • You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Content from Running Python code with Snakemake


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How do I configure an environment to run Python with Snakemake?

Objectives

  • Create a Conda environment definition
  • Use Snakemake to instantiate this environment and use it to run Python code

Why define an environment


Snakemake is written in Python, so anyone running a Snakemake workflow already has Python available. In principle, we could make use of this installation to run any Python code we need to run in our workflow. However, it’s more than likely we will need to make use of libraries that are not installed as part of the Snakemake installation. At that point, we would either need to install additional libraries into our Snakemake environment, which might be used by other analyses that need their own sets of libraries that could create conflicts, or to create a second Snakemake environment for this analysis. If different steps of our workflow need different, conflicting sets of libraries then this becomes more complicated again.

We would also like those trying to reproduce our work to be able to run using exactly the same software environment that we used in our original work. In principle, we could write detailed documentation specifying which packages to install; however, it is both more precise and more convenient to define the environment as a data file, which Conda can use to build the same environment every time.

Even better, we can tell Snakemake to use a specific Conda environment for each rule we define.

A basic environment definition


Conda environment definitions are created in YAML-format files. These specify what Conda packages are needed (including the target version of Python), as well as any Pip packages that are installed.

Callout

Some packages give you a choice as to whether to install using Conda or Pip. When working interactively with an environment, using Pip consistently typically reduces the chance of getting into a state where Conda is unable to install packages. That is less of a problem when constructing new environments from definition files, but even so, using Pip where possible will typically allow environments to resolve and install more quickly.

By convention, Conda environment definitions in Snakemake workflows are placed in a workflow/envs/ directory. Let’s create workflow/envs/analysis.yml now, and place the following content into it

YAML

name: su2pg_analysis
channels:
  - conda-forge
dependencies:
  - pip=24.2
  - python=3.12.6
  - pip:
      - h5py==3.11.0
      - matplotlib==3.9.2
      - numpy==2.1.1
      - pandas==2.2.3
      - scipy==1.14.1
      - uncertainties==3.2.2
      - corrfitter==8.2
      - -e ../../libs/su2pg_analysis

This will install the specified versions of h5py, Matplotlib, Numpy, Pandas, Scipy, uncertainties, and corrfitter, as well as the analysis tools provided in the libs directory. The latter are installed in editable mode, so if you need to modify them to fix bugs or add functionality while working on your workflow, you don’t need to remember to manually reinstall them.

Using an environment definition in a Snakefile


Now that we have created an environment file, we can use it in our Snakefile to compute the average plaquette from a configuration generation log. Let’s add the following rule to envs/Snakefile:

rule avg_plaquette:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette raw_data/beta2.0/out_pg --output_file intermediary_data/beta2.0/pg.plaquette.json.gz"

The conda: block tells Snakemake where to find the Conda environment that should be used for running this rule. Let’s test this now:

snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json.gz

We need to specify --use-conda to tell Snakemake to pay attention to the conda: specification.

Let’s check now that the output was correctly generated:

$ cat intermediary_data/beta2.0/pg.plaquette.json.gz | gunzip | head -n 20
{
 "program": "pyerrors 2.13.0",
 "version": "1.1",
 "who": "ed",
 "date": "2025-05-20 15:44:09 +0100",
 "host": "tsukasa.lan, macOS-15.4-arm64-arm-64bit",
 "description": {
  "group_family": "SU",
  "num_colors": 2,
  "nt": 48,
  "nx": 24,
  "ny": 24,
  "nz": 24,
  "beta": 2.0,
  "num_heatbath": 1,
  "num_overrelaxed": 4,
  "num_thermalization": 1000,
  "thermalization_time": 2453.811479,
  "num_trajectories": 10010
 },

Some of the output will differ on your machine, since this library tracks provenance, such as where and when the code was run, in the output file.

More plaquettes

Add a second rule to compute the average plaquette in the file intermediary_data/beta2.2/out_pg. Add this to the same Snakefile you already made, under the avg_plaquette rule, and run your rules in the terminal. When running the snakemake command you’ll need to tell Snakemake to make both the output files.

You can choose whatever name you like for this second rule, but it can’t be avg_plaquette, as rule names need to be unique within a Snakefile. So in this example answer we use avg_plaquette2.

rule avg_plaquette2:
    input: "raw_data/beta2.2/out_pg"
    output: "intermediary_data/beta2.2/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette raw_data/beta2.2/out_pg --output_file intermediary_data/beta2.2/pg.plaquette.json.gz"

Then in the shell:

snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.plaquette.json intermediary_data/beta2.2/pg.plaquette.json

If you think writing a separate rule for each output file is silly, you are correct. We’ll address this next!

Content from Placeholders and wildcards


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How do I make a generic rule?
  • How does Snakemake decide what rule to run?

Objectives

  • Use Snakemake to compute the plaquette in any file
  • Understand the basic steps Snakemake goes through when running a workflow
  • See how Snakemake deals with some errors

Making rules more generic


In the previous two episodes, we wrote rules to count the number of generated trajectories in, and compute the average plaquette of, one ensemble. As a reminder, this was one such rule:

rule count_trajectories:
    input: "raw_data/beta2.0/out_pg"
    output: "intermediary_data/beta2.0/pg.count"
    shell:
        "grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"

When we needed to do the same for a second ensemble, we made a second copy of the rule, and changed the input and output filenames. This is obviously not scalable to large analyses: instead, we would like to write one rule for each type of operation we are interested in. To do this, we’ll need to use placeholders and wildcards. Such a generic rule might look as follows:

# Count number of generated trajectories for any ensemble
rule count_trajectories:
    input: "raw_data/{subdir}/out_pg"
    output: "intermediary_data/{subdir}/pg.count"
    shell:
        "grep -c generated {input} > {output}"

Comments in Snakefiles

In the above code, the line beginning # is a comment line. Hopefully you are already in the habit of adding comments to your own software. Good comments make any code more readable, and this is just as true with Snakefiles.

{subdir} here is an example of a wildcard Wildcards are used in the input and output lines of the rule to represent parts of filenames. Much like the * pattern in the shell, the wildcard can stand in for any text in order to make up the desired filename. As with naming your rules, you may choose any name you like for your wildcards; so here we used subdir, since it is describing a subdirectory. If subdir is set to beta2.0 then the new generic rule will have the same inputs and outputs as the original rule. Using the same wildcards in the input and output is what tells Snakemake how to match input files to output files.

If two rules use a wildcard with the same name then Snakemake will treat them as different entities—rules in Snakemake are self-contained in this way.

Meanwhile, {input} and {output} are placeholders Placeholders are used in the shell section of a rule. Snakemake will replace them with appropriate values before running the command: {input} with the full name of the input file, and {output} with the full name of the output file.

If we had wanted to include the value of the subdir wildcard directly in the shell command, we could have used the placeholder {wildcards.subdir}, but in most cases, as here, we just need the {input} and {output} placeholders.

Let’s test this general rule now:

snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/pg.count

As previously, if you see errors at this point, there is likely a problem with your Snakefile; check that all the rules match the ones that have appeared here, and that there aren’t multiple rules with the same name.

General plaquette computation

Modify your Snakefile so that it can compute the average plaquette for any ensemble, not just the ones we wrote specific rules for in the previous episode.

Test this with some of the values of \(\beta\) present in the raw data.

The replacement rule should look like:

# Compute average plaquette for any ensemble from its generation log
rule avg_plaquette:
    input: "raw_data/{subdir}/out_pg"
    output: "intermediary_data/{subdir}/pg.plaquette.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.plaquette {input} --output_file {output}"

To test this, for example:

snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta1.8/pg.plaquette.json.gz

Choosing the right wildcards

Our rule puts the sequence counts into output files named like pg.count. How would you have to change the count_trajectories rule definition if you wanted:

  1. the output file for raw_data/beta1.8/out_hmc to be intermediary_data/beta1.8/hmc.count?

  2. the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/beta1.8/mass_fun-0.63/hmc.count?

  3. the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/hmc_b3.0_m-0.63.count (for raw_data/beta1.9/mass_fun-0.68/out_pg to be intermediary_data/hmc_b1.9_m-0.68.count, etc.)?

  4. the output file for raw_data/beta1.8/mass_fun-0.63/out_hmc to be intermediary_data/hmc_m-0.63.count (for raw_data/beta1.9/mass_fun-0.68/out_pg to be intermediary_data/hmc_m-0.68.count, etc.)?

(Assume that both pure-gauge and HMC logs tag generated trajectories the same way. Note that input files for the latter data are not included in the sample data, so these will not work as-is.)

In both cases, there is no need to change the shell part of the rule at all.

input: "raw_data/{subdir}/out_hmc"
output: "intermediary_data/{subdir}/hmc.count"

This can be done by changing only the static parts of the input: and output: lines.

This in fact requires no change from the previous answer. The wildcard {subdir} can include /, so can represent multiple levels of subdirectory.

input: "raw_data/beta{beta}/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_b{beta}_m{mass}.count"

In this case, it was necessary to change the wildcards, because the subdirectory name needs to be split to obtain the values of \(\beta\) and \(m_{\mathrm{fun}}\). The names chosen here are {beta} and {mass}, but you could choose any names, as long as they match between the input and output parts.

This one isn’t possible, because Snakemake cannot determine which input file you want to count by matching wildcards on the file name intermediary_data/hmc_m-0.63.count. You could try a rule like this:

input: "raw_data/beta1.8/mass_fun{mass}/out_hmc"
output: "intermediary_data/hmc_m{mass}.count"

…but it only works because \(\beta\) is hard-coded into the input line, and the rule will only work on this specific sample, not other cases where other values of \(\beta\) may be wanted. In general, input and output filenames need to be carefully chosen so that Snakemake can match everything up and determine the right input from the output filename.

Filenames aren’t data

Notice that in some examples we can pull out the value of \(\beta\) from the name of the directory in which the file is located. However, ideally, we should avoid relying on this being correct. The name and location are useful for us to find the correct file, but we should try to ensure that the file contents also contain these data, and that we make use of those data in preference to the filename.

Snakemake order of operations


We’re only just getting started with some simple rules, but it’s worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases:

  1. Prepares to run:
    1. Reads in all the rule definitions from the Snakefile
  2. Plans what to do:
    1. Sees what file(s) you are asking it to make
    2. Looks for a matching rule by looking at the outputs of all the rules it knows
    3. Fills in the wildcards to work out the input for this rule
    4. Checks that this input file is actually available
  3. Runs the steps:
    1. Creates the directory for the output file, if needed
    2. Removes the old output file if it is already there
    3. Only then, runs the shell command with the placeholders replaced
    4. Checks that the command ran without errors and made the new output file as expected

For example, if we now ask Snakemake to generate a file named intermediary_data/wibble_1/pg.count:

OUTPUT

$ snakemake --jobs 1 --forceall --printshellcmds intermediary_data/wibble_1/pg.count
Building DAG of jobs...
MissingInputException in line 1 of /home/zenmaster/data/su2pg/workflow/Snakefile:
Missing input files for rule count_trajectories:
    output: intermediary_data/wibble_1/pg.count
    wildcards: subdir=wibble_1
    affected files:
        raw_data/wibble_1/out_pg

Snakemake sees that a file with a name like this could be produced by the count_trajectories rule. However, when it performs the wildcard substitution it sees that the input file would need to be named raw_data/wibble_1/out_pg, and there is no such file available. Therefore Snakemake stops and gives an error before any shell commands are run.

Dry-run (--dry-run) mode

It’s often useful to run just the first two phases, so that Snakemake will plan out the jobs to run, and print them to the screen, but never actually run them. This is done with the --dry-run flag, eg:

BASH

$ snakemake --dry-run --forceall --printshellcmds temp33_1_1.fq.count

We’ll make use of this later in the lesson.

The amount of checking may seem pedantic right now, but as the workflow gains more steps this will become very useful to us indeed.

Key Points

  • Snakemake rules are made generic with placeholders and wildcards
  • Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
  • Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
  • Snakemake checks for various error conditions and will stop if it sees a problem

Content from Chaining rules


Last updated on 2025-07-28 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How do I combine rules into a workflow?
  • How can I make a rule with multiple input files?
  • How should I refer to multiple files with similar names?

Objectives

  • Use Snakemake to filter and then count the sequences in a FASTQ file
  • Understand how rules are linked by filename patterns
  • Be able to use multiple input files in one rule

A pipeline of multiple rules


We have so far been able to count the number of generated trajectories, and compute the average plaquette, given an output log from the configuration generation. However, an individual average plaquette is not interesting in isolation; what is more interesting is how it varies between different values of the input parameters. To do this, we will need to take the output of the avg_plaquette rule that we defined earlier, and use it as input for another rule.

Let’s define that rule now:

# Take individual data files for average plaquette and plot combined results
rule plot_avg_plaquette:
    input:
        "intermediary_data/beta1.8/pg.plaquette.json.gz",
        "intermediary_data/beta2.0/pg.plaquette.json.gz",
        "intermediary_data/beta2.2/pg.plaquette.json.gz",
    output:
        "assets/plots/plaquette_scan.pdf"
    conda: "envs/analysis.yml"
    shell:
        "python src/plot_plaquette.py {input} --output_filename {output}"

Callout

You can see that here we’re putting “files that want to be included in a paper” in an assets directory, similarly to the raw_data and intermediary_data directories we discussed in a previous episode. It can be useful to further distinguish plots, tables, and other definitions, by using subdirectories in this directory.

Rather than one input, as we have seen in rules so far, this rule requires three. When Snakemake substitutes these into the {input} placeholder, it will automatically add a space between them. Let’s test this now:

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Look at the logging messages that Snakemake prints in the terminal. What has happened here?

  1. Snakemake looks for a rule to make assets/plots/plaquette_scan.pdf
  2. It determines that the plot_avg_plaquette rule can do this, if it has intermediary_data/beta1.8/pg.plaquette.json.gz, intermediary_data/beta2.0/pg.plaquette.json.gz, and intermediary_data/beta2.2/pg.plaquette.json.gz.
  3. Snakemake looks for a rule to make intermediary_data/beta1.8/pg.plaquette.json.gz
  4. It determines that avg_plaquette can make this if subdir=beta1.8
  5. It sees that the input needed is therefore raw_data/beta1.8/out_pg
  6. Now Snakemake has reached an available input file, it runs the avg_plaquette rule.
  7. It then looks through the other two \(\beta\) values in turn, repeating the process until it has all of the needed inputs.
  8. Finally, it runs the plot_avg_plaquette rule.

Here’s a visual representation of this process:

A visual representation of the above process showing the rule definitions, with arrows added to indicate the order wildcards and placeholders are substituted. Blue arrows start from the input of the `plot_avg_plaquette` rule, which are the files `intermediary_data/beta1.8/pg.plaquette.json.gz`, `intermediary_data/beta2.0/pg.plaquette.json.gz`, and `intermediary_data/beta2.2/pg.plaquette.json.gz`, then point down from components of the filename to wildcards in the output of the `avg_plaquette` rule. Orange arrows then track back up through the shell parts of both rules, where the placeholders are, and finally back to the target output filename at the top.

This, in a nutshell, is how we build workflows in Snakemake.

  1. Define rules for all the processing steps
  2. Choose input and output naming patterns that allow Snakemake to link the rules
  3. Tell Snakemake to generate the final output files

If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

Choosing file name patterns

Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it, and most times there are several options that will work, but in all cases the file names you choose will need to be consistent and unabiguous.

Making file lists easier


In the rule above, we plotted the average plaquette for three values of \(\beta\) by listing the files expected to contain their values. In fact, we have data for a larger number of \(\beta\) values, but typing out each file by hand would be quite cumbersome. We can make use of the expand() function to do this more neatly:

    input:
        expand(
            "intermediary_data/beta{beta}/pg.plaquette.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        )

The first argument to expand() here is a template for the filename, and subsequent keyword arguments are lists of variables to fill into the placeholders. The output is the cartesian product of all the parameter lists.

We can check that this works correctly:

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Tabulating trajectory counts

The script src/tabulate_counts.py will take a list of files containing plaquette data, and output a LaTeX table of trajectory counts. Write a rule to generate this table for all values of \(\beta\), and output it to assets/tables/counts.tex.

The replacement rule should look like:

# Output a LaTeX table of trajectory counts
rule tabulate_counts:
    input:
        expand(
            "intermediary_data/beta{beta}/pg.count",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        )
    output: "assets/tables/counts.tex"
    conda: "envs/analysis.yml"
    shell:
        "python src/tabulate_counts.py {input} > {output}"

To test this, for example:

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/tables/counts.tex

Tabulating trajectory counts (continued)

This setup currently requires reading the value of \(\beta\) from the filename. Why is this not ideal? How would the workflow need to be changed to avoid this?

It’s easy for files to be misnamed when creating or copying them. Putting the wrong data into the file is harder, especially when it’s a raw data file generated by the same program as the rest of the data. (If the wrong value were given as input, this could happen, but the corresponding output data would also be generated at that incorrect value. Provided the values are treated consistently, the downstream analysis could in fact still be valid, just not exactly as intended.)

Currently, grep -c is used to count the number of trajectories. This would need to be replaced or supplemented with a tool that read out the value of \(\beta\) from the input log, and outputs it along with the trajectory count. The src/tabulate_counts.py script could then be updated to use this number, rather than the filename.

In fact, the plaquette module does just this; in addition to the average plaquette, it also records the number of trajectories generated as part of the metadata and provenance information it tracks.

Key Points

  • Snakemake links up rules by iteratively looking for rules that make missing inputs
  • Careful choice of filenames allows this to work
  • Rules may have multiple named input files (and output files)
  • Use expand() to generate lists of filenames from a template

Content from Metadata and parameters


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Global parameters


Thus far, each of our rules has taken one or more input files, and given output files solely based on that. However, in some cases we may want to control options without having them within an input file.

For example, in the previous episode, we wrote a rule to plot a graph using the script src/plot_plaquette.py. The style of output we got was good for a paper, but if we were producing a poster, or putting the plot onto a slide with a dark background, we may wish to use a different output style. The plot_plaquette.py script accepts a --styles argument, to tell it what style file to use to plot. One way to make use of this would be to add --styles styles/paper.mplstyle directly to the shell: block. However, if we had many such rules, and wanted to switch from generating output for a paper to generating it for a poster, then we would need to change the value in many places. Instead, we can define a variable at the top of the Snakefile

plot_styles = "styles/jhep.mplstyle"

Then, when we use a script to generate a plot, we can update the shell: block of the corresponding rule similarly to

"python src/plot_plaquette.py {input} --output_filename {output} --plot_styles {plot_styles}"

Snakemake will substitute the value of the global plot_styles variable in place of the {plot_styles} placeholder.

We can test this by changing paper to poster, and running

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

We can see that the generated file now uses a different set of fonts.

Wilson flow

The tool su2pg_analysis.w0 computes the scale \(w_0\) given a log of the energy density during evolution of the Wilson flow for an ensemble. To do this, the reference scale \(\mathcal{W}_0\) needs to be passed to the --W0 parameter. Use this, and the logs stored in the files out_wflow for each ensemble’s raw data directory, to output the \(w_0\) scale in a file wflow.w0.json for each ensemble, taking the reference value \(\mathcal{W_0} = 0.2\).

W0_reference = 0.2

# Compute w0 scale for single ensemble for fixed reference scale
rule w0:
    input: "raw_data/{subdir}/out_wflow"
    output: "intermediary_data/{subdir}/wflow.w0.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.w0 {input} --W0 {W0_reference} --output_file {output}"

Generating different filetypes

In addition to different plot styles, we may also wish to generate different filetypes. PDF is useful for including in LaTeX, but SVG may be a better format to use with some tools.

If we add a global definition:

plot_filetype = "pdf"

and update the output: block of the rule as:

    output:
        "assets/plots/plaquette_scan.{plot_filetype}"

does this have the same effect as the example with --styles above?

(Hint: what happens when you try to make the targets assets/plots/plaquette_scan.svg and assets/plots/plaquette_scan.txt by specifying them at the command line, without changing the value of plot_filetype?)

This can achieve a similar result, but in a slightly different way. In the --styles example, the {plot_styles} string is in the shell: block, and so directly looks up the plot_styles variable. (Recall that to look up a wildcard, we needed to explicitly use wildcards..)

However, in this case the {plot_filetype} string is in the output: block, so defines a wildcard. This may take any value, so if we instruct snakemake to produce plaquette_scan.txt, it will diligently pass that filename to plot_plaquette.py.

The plot_filetype = "pdf" is in fact ignored. It could however be used to set a default set of targets to generate, which we will talk about in a later episode.

Metadata from a file


We would frequently like our rules to depend on data that are specific to the specific ensembles being analysed. For example, consider the rule:

# Compute pseudoscalar mass and amplitude with fixed plateau
rule ps_mass:
    input: "raw_data/{subdir}/out_corr"
    output: "intermediary_data/{subdir}/corr.ps_mass.json.gz"
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start 18 --plateau_end 23"

This rule hardcodes the positions of the start and end of the plateau region. In most studies, each ensemble and observable may have a different plateau position, so there is no good value to hardcode this to. Instead, we’d like a way of picking the right value from some list of parameters that we specify.

We could do this within the Snakefile, but where possible it is good to avoid mixing data with code. We shouldn’t need to modify our code every time we add or modify the data it is analysing. Instead, we’d like to have a dedicated file containing these parameters, and to be able to have Snakemake read it and pick out the correct values.

To do this, we can exploit the fact that Snakemake is an extension of Python. In particular, Snakemake makes use of the Pandas library for tabular data, which we can use to read in a CSV files. Let’s add the following to the top of the file:

import pandas

metadata = pandas.read_csv("metadata/ensemble_metadata.csv")

The file being read here is a CSV (Comma Separated Values) file. We can create, view, and modify this with the spreadsheet tool of our choice. Let’s take a look at the file now.

Screenshot of a spreadsheet application showing the file metadata/ensemble_metadata.csv.
Screenshot of a spreadsheet application showing the file metadata/ensemble_metadata.csv.

You can see that we have columns defining metadata to identify each ensemble, and columns for parameters relating to the analysis of each ensemble.

Now, how do we tell Snakemake to pull out the correct value from this?

# Compute pseudoscalar mass and amplitude, read plateau from metadata
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output: "intermediary_data/beta{beta}/corr.ps_mass.json.gz"
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end}"

We’ve done a couple of things here. Firstly, we’ve made explicit the reference to \(\beta\) in the file paths, so that we can use beta as a wildcard, similarly to in the challenge in the previous episode. Secondly, we’ve introduced a params: block. This is how we tell Snakemake about quantities that may vary from run to run, but that are not filenames. Thirdly, we’ve used the lookup() function to search the metadata dataframe for the ensemble that we are considering. Finally, we’ve used {params.plateau_start} and {params.plateau_end} placeholders to use these parameters in the shell command that gets run.

Let’s test this now:

snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json
cat intermediary_data/beta2.0/corr.ps_mass.json

OUTPUT

TODO

TODO CHALLENGE

Content from Multiple inputs and outputs


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Multiple outputs


Quite frequently, we will want a rule to be able to generate more than one file. It’s important we let Snakemake know about this, both so that it can instruct our tools on where to place these files, and so it can verify that they are correctly created by the rule. For example, when fitting a correlation function with a plateau region that we specify, it’s important to look at an effective mass plot to verify that the plateau actually matches what we assert. The rule we just wrote doesn’t do this—it only spits out a numerical answer. Let’s update this rule so that it can also generate the effective mass plot.

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot="intermediary_data/beta{beta}/corr.ps_eff_mass.pdf",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles}"

Rather than having a single string after output:, we now have a block with two lines. Each line has the format name=value, and is followed by a comma. To make use of these variables in our rule, we follow output by a ., and then the name of the variable we want to use, similarly to what we do for wildcards and params.

Non-specificity

What happens if we define multiple named output: variables like this, but refer to the {output} placeholder in the shell: block without specifying a variable name?

(One way to find this out is to try echo {output} as the entire shell: content; this will generate a missing output error, but will first let you see what the output is.)

Snakemake will provide all of the defined output variables, as a space-separated list. This is similar to what happens when an output variable is a list, as we saw earlier when looking at the expand() function.

Flow plots

Update the Wilson flow \(w_0\) computation that we looked at in a previous challenge to also output the flow of \(\mathcal{W}(t)\), so that the shape of the flow may be checked.

rule w0:
    input: "raw_data/{subdir}/out_wflow"
    output:
        data="intermediary_data/{subdir}/wflow.w0.json.gz",
        plot="intermediary_data/{subdir}/wflow.W_flow.{plot_filetype}",
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.w0 {input} --W0 {W0_reference} --output_file {output} --plot_file {output.plot} --plot_styles {plot_styles}"

Multiple inputs


Similarly to outputs, there are many situations where we want to work with more than one class of input file—for example, to combine differing observables into one. For example, the meson_mass rule we wrote previously also outputs the amplitude of the exponential. When combined with the average plaquette via one-loop matching, this can be used to give an estimate of the decay constant. The syntax for this is the same as we saw above for output:.

rule one_loop_matching:
    input:
        plaquette="intermediary_data/{subdir}/pg.plaquette.json.gz",
        meson="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
    output:
        data="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.one_loop_matching --plaquette {input.plaquette} --meson {input.meson} --output_file {output.data}"

Naming things

Even when there is only one output: file, we are still allowed to name it. This makes life easier if we need to add more outputs later, and can make it a little clearer what our intent is when we come to read the workflow later.

Spectrum plot

Write a rule that plots the pseudoscalar channel’s decay constant against its mass, for each ensemble studied. The tool src/plot_spectrum.py will help with this.

Try making the filename of the tool a parameter too, so that if the script is changed, Snakemake will correctly re-run the workflow.

rule spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
        ps_decay_const="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
    output:
        plot="assets/plots/spectrum.{plot_filetype}"
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --output_file {output.plot} --plot_styles {plot_styles}"

Log files


When a process run by Snakemake exits with an error code, Snakemake removes all the expected output files. Usually this is what we want: we don’t want to have potentially corrupt output, that might be used as input for subsequent rules. However, there are some classes of output file that are useful in helping to identify what caused the error in the first place: log files.

We can tell Snakemake that specified files are log files, rather than regular output files, by placing them in a log: block rather than an output: one. Snakemake will not delete a file marked as a log if an error is raised by the process generating it.

For example, for the ps_mass rule above, we might use:

# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input:
        data="raw_data/beta{beta}/out_corr",
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot="intermediary_data/beta{beta}/corr.ps_eff_mass.{plot_filetype}",
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"

|& tee

You may recall that | is the pipe operator in the Unix shell, taking standard output from one program and passing it to standard input of the next. (If this is unfamiliar, you may wish to look through the Software Carpentry introduction to [the Unix shell][shell-novice] when you have a moment.)

Adding the & symbol means that both the standard output and standard error streams are piped, rather than only standard output. This is useful for a log file, since we will typically want to see errors there.

The tee command “splits a pipe”; that is, it takes standard input and outputs it both to standard output and to the specified filename. This way, we get the log on disk, but also output to screen as well, so we can monitor issues as the workflow runs.

Logged plots

Adjust the solution for plotting the spectrum above so that any warnings or errors generated by the plotting script are logged to a file.

rule spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass="intermediary_data/{subdir}/corr.{channel}_mass.json.gz",
        ps_decay_const="intermediary_data/{subdir}/pg.corr.{channel}_decay_const.json.gz",
    output:
        plot="assets/plots/spectrum.{plot_filetype}"
    log:
        messages="intermediary_data/{subdir}/pg.corr.{channel}_mass.log"
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --output_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"

Dealing with errors


We’ll end the chapter by looking at a common problem that can arise if you mistype a filename in a rule. It may seem silly to break the workflow when we just got it working, but it will be instructive, so let’s modify the Snakefile and deliberately specify an incorrect output filename in the ps_mass rule.

...
shell:
        "python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data}.json_ --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"

To keep things tidy, this time we’ll manually remove the intermediary data directory.

BASH

$ rm -rvf intermediary_data

And re-run.

$ snakemake --jobs 1 --forceall --printshellcmds --use-conda intermediary_data/beta2.0/corr.ps_mass.json

...
TODO

There’s a lot to take in here. Some of the messages are very informative. Some less so.

  1. Snakemake did actually run the tool, as evidenced by the output from the program that we see on the screen.
  2. Python is reporting that there is a file missing.
  3. Snakemake complains one expected output file is missing: intermediary_data/beta2.0/corr.ps_mass.json.
  4. The other expected output file intermediary_data/beta2.0/corr.ps_eff_mass.pdf was found but has now been removed by Snakemake.
  5. Snakemake suggest this might be due to “filesystem latency”.

This last point is a red herring. “Filesystem latency” is not an issue here, and never will be, since we are not using a network filesystem. We know what the problem is, as we deliberately caused it, but to diagnose an unexpected error like this we would investigate further by looking at the intermediary_data/beta2.0 subdirectory.

$ ls intermediary_data/beta2.0/
TODO

Remember that Snakemake itself does not create any output files. It just runs the commands you put in the shell sections, then checks to see if all the expected output files have appeared.

So if the file names created by your rule are not exactly the same as in the output: block you will get this error, and you will, in this case, find that some output files are present but others (corr.ps_eff_mass.pdf, which was named correctly) have been cleaned up by Snakemake.

Errors are normal

Don’t be disheartened if you see errors like the one above when first testing your new Snakemake workflows. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off, as we’ll see in the next episode.

Finally, edit the names in the Snakefile back to the correct version and re-run to confirm that all is well.

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf

Content from How Snakemake plans jobs


Last updated on 2025-07-28 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How do I visualise a Snakemake workflow?
  • How does Snakemake avoid unecessary work?
  • How do I control what steps will be run?

Objectives

  • View the DAG for our pipeline
  • Understand the logic Snakemake uses when running and re-running jobs

The DAG


You may have noticed that one of the messages Snakemake always prints is:

OUTPUT

Building DAG of jobs...

A DAG is a Directed Acyclic Graph and it can be pictured like so:

TODO
TODO

The above DAG is based on three of our existing rules, and shows all the jobs Snakemake would run to compute the pseudoscalar decay constant of the \(\beta = 4.0\) ensemble.

Note that:

  • A rule can appear more than once, with different wildcards (a rule plus wildcard values defines a job)
  • A rule may not be used at all, if it is not required for the target outputs
  • The arrows show dependency ordering between jobs
  • Snakemake can run the jobs in any order that doesn’t break dependency. For example, one_loop_matching cannot run until both ps_mass and avg_plaquette have completed, but it may run before or after count_trajectories
  • This is a work list, not a flowchart, so there are no if/else decisions or loops. Snakemake runs every job in the DAG exactly once
  • The DAG depends both on the Snakefile and on the requested target outputs, and the files already present
  • When building the DAG, Snakemake does not look at the shell part of the rules at all. Only when running the DAG will Snakemake check that the shell commands are working and producing the expected output files

How many jobs?

If we asked Snakemake to run one_loop_matching on all twelve ensembles (beta2.0 to beta10.0), how many jobs would that be in total?

36 in total:

  • 12 \(\times\) one_loop_matching
  • 12 \(\times\) ps_mass
  • 12 \(\times\) avg_plaquette
  • 0 \(\times\) count_trajectories

Snakemake is lazy, and laziness is good


For the last few episodes, we’ve told you to run Snakemake like this:

snakemake --jobs 1 --forceall --printshellcmds --use-conda 

As a reminder, the --jobs 1 flag tells Snakemake to run one job at a time, --printshellcmds is to print out the shell commands before running them, and --use-conda to ensure that Snakemake sets up the correct Conda environment.

The --forceall flag turns on forceall mode, and in normal usage you don’t want this.

At the end of the last chapter, we generated a spectrum plot by running:

snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf

Now try without the --forceall option. Assuming that the output files are already created, you’ll see this:

$ snakemake --jobs 1 --forceall --printshellcmds --use-conda assets/plots/spectrum.pdf
TODO

In normal operation, Snakemake only runs a job if:

  1. A target file you explicitly requested to make is missing,
  2. An intermediate file is missing and it is needed in the process of making a target file,
  3. Snakemake can see an input file which is newer than an output file, or
  4. A rule definition or configuration has changed since the output file was created.

The last of these relies on a ledger that Snakemake saves into the .snakemake directory.

Let’s demonstrate each of these in turn, by altering some files and re-running Snakemake without the --forceall option.

BASH

$ rm assets/plots/spectrum.pdf
$ snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf

This just re-runs spectrum, the final step.

BASH

$ rm intermediary_data/beta*/corr.ps_mass.json
$ snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf

“Nothing to be done”. Some intermediate output is missing, but Snakemake already has the file you are telling it to make, so it doesn’t worry.

BASH

touch raw_data/beta*/out_pg
snakemake --jobs 1 --printshellcmds --use-conda assets/plots/spectrum.pdf

The touch command is a standard Unix command that resets the timestamp of the file, so now the correlators look to Snakemake as if they were just modified.

Snakemake sees that some of the input files used in the process of producing assets/plots/spectrum.pdf are newer than the existing output file, so it needs to run the avg_plaquette and one_loop_matching steps again. Of course, the one_loop_matching step needs the pseudoscalar mass data that we deleted earlier, so now the correlation function fitting step is re-run also.

Explicitly telling Snakemake what to re-run


The default timestamp-based logic is really useful when you want to:

  1. Change or add some inputs to an existing analysis without re-processing everything
  2. Continue running a workflow that failed part-way

In most cases you can also rely on Snakemake to detect when you have edited a rule, but sometimes you need to be explicit, for example if you have updated an external script or changed a setting that Snakemake doesn’t see.

The --forcerun flag allows you to explicitly tell Snakemake that a rule has changed and that all outputs from that rule need to be re-evaluated.

snakemake  --forcerun spectrum --jobs 1--printshellcmds --use-conda assets/plots/spectrum.pdf

Note on --forcerun

Due to a quirk of the way Snakemake parses command-line options, you need to make sure there are options after the --forcerun ..., before the list of target outputs. If you don’t do this, Snakemake will think that the target files are instead items to add to the --forcerun list, and then when building the DAG it will just try to run the default rule.

The easiest way is to put the --jobs flag before the target outputs. Then you can list multiple rules to re-run, and also multiple targets, and Snakemake can tell which is which.

BASH

snakemake --forcerun avg_plaquette ps_mass --jobs 1 --printshellcmds --use-conda intermediary_data/beta2.0/pg.corr.ps_decay_const.json intermediary_data/beta2.5/pg.corr.ps_decay_const.json

The reason for using the --jobs flag specifically is that you pretty much always want this option.

The --force flag specifies that the target outputs named on the command line should always be regenerated, so you can use this to explicitly re-make specific files.

BASH

$ snakemake --jobs 1 --force --printshellcmds assets/plots/spectrum.pdf

This always re-runs spectrum, regardless of whether the output file is there already. For all intermediate outputs, Snakemake applies the default timestamp-based logic. Contrast with --forceall, which runs the entire DAG every time.

Visualising the DAG


Snakemake can draw a picture of the DAG for you, if you run it like this:

snakemake --force --dag assets/plots/spectrum.pdf | gm display -

Using the --dag option implicitly activates the --dry-run option so that Snakemake will not actually run any jobs, it will just print the DAG and stop. Snakemake prints the DAG in a text format, so we use the gm command to make this into a picture and show it on the screen.

Note on gm display

The gm command is provided by the GraphicsMagick toolkit. On systems where gm will not display an image directly, you can instead save it to a PNG file. You will need the dot program from the GraphViz package installed.

snakemake --force --dag assets/plots/spectrum.pdf | dot -Tpng > dag.png

![TODO][fig-dag2]

The boxes drawn with dotted lines indicate steps that are not to be run, as the output files are already present and newer than the input files.

Visualising the effect of the --forcerun and --force flags

Run one_loop_matching on the beta2.0 ensemble, and then use the --dag option as shown above to check:

  1. How many jobs will run if you ask again to create this output with no --force, --forcerun or -forceall options?

  2. How many if you use the --force option?

  3. How many if you use the --forcerun ps_mass option?

  4. How many if you edit the metadata file so that the ps_plateau_start for the \(\beta=4.0\) ensemble is TODO, rather than TODO?

This is a way to make the result in the first place:

BASH

$ snakemake --jobs 1 --printshellcmds intermediary_data/beta2.0/pg.corr.ps_decay_const.json
  1. This command should show three boxes, but all are dotted so no jobs are actually to be run.

BASH

$ snakemake --dag intermediary_data/beta2.0/pg.corr.ps_decay_const.json | gm display -
  1. The --force flag re-runs only the job to create the output file, so in this case one box is solid, and only that job will run.

  2. With --forcerun ps_mass, the ps_mass job will re-run, and Snakemake sees that this also requires re-running one_loop_matching, so the answer is 2.

    If you see a message like the one below, it’s because you need to put an option after ps_mass or else Snakemake gets confused about what are parameters of --forcerun, and what things are targets.

    ERROR

    WorkflowError:
    Target rules may not contain wildcards.
  3. Editing the Snakefile has the same effect as forcing the ps_mass rule to re-run, so again there will be two jobs to be run from the DAG.

With older versions of Snakemake this would not be auto-detected, and in fact you can see this behaviour if you remove the hidden .snakemake directory. Now Snakemake has no memory of the rule change so it will not re-run any jobs unless explicitly told to.

Removing files to trigger reprocessing

In general, getting Snakemake to re-run things by removing files is a bad idea, because it’s easy to forget about intermediate files that actually contain stale results and need to be updated. Using the --forceall flag is simpler and more reliable. If in doubt, and if it will not be too time consuming, keep it simple and just use --forceall to run the whole workflow from scratch.

For the opposite case where you want to avoid re-running particular steps, see the ‑‑touch option of Snakemake mentioned later in the lesson.

TODO Diagram showing jobs as coloured boxes joined by arrows representing data flow. The box labelled as kallisto_index is in green at the top, with two blue boxes labelled trimreads and two yellow boxes labelled countreads. The blue trimreads boxes have arrows into the respective yellow countreads boxes. Finally there is a kallisto_quant job shown as a red box, with incoming arrows from both the trimreads box as well as the kallisto_index box.’ } [fig-dag2]: fig/dag_2.png {alt=’ TODO A DAG for the partial workflow with four boxes, representing two trimreads jobs and a kallisto_index job, then a kallisto_quant job receiving input from the previous three. The boxes for the kallisto_index and trimreads jobs are dotted, but the kallisto_quant box is solid.’}

Key Points

  • A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
  • Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
  • If output files already exist, Snakemake can skip parts of the DAG
  • Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Content from Optimising workflow performance


Last updated on 2025-05-20 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • What compute resources are available on my system?
  • How do I define jobs with more than one thread?
  • How do I measure the compute resources being used by a workflow?
  • How do I run my workflow steps in parallel?

Objectives

  • Understand CPU, RAM and I/O bottlenecks
  • Understand the threads declaration
  • Use common Unix tools to look at resource usage

Processes, threads and processors


Some definitions:

  • Process: A running program (in our case, each Snakemake job can be considered one process)
  • Threads: Each process has one or more threads which run in parallel
  • Processor: Your computer has multiple CPU cores or processors, each of which can run one thread at a time

These definitions are a little simplified, but fine for our needs. The operating system kernel shares out threads among processors:

  • Having fewer threads than processors means you are not fully using all your CPU cores
  • Having more threads than processors means threads have to “timeslice” on a core which is generally suboptimal

If you tell Snakemake how many threads each rule will use, and how many cores you have available, it will start jobs in parallel to use all your cores. In the diagram below, five jobs are ready to run and there are four system cores.

Listing the resources of your machine


So, to know how many threads to make available to Snakemake, we need to know how many CPU cores we have on our machine. On Linux, we can find out how many cores you have on a machine with the lscpu command.

$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  Model name:             Cortex-A72
    Model:                3
    Thread(s) per core:   1

There we can see that we have four CPU cores, each of which can run a single thread.

On macOS meanwhile, we use the command sysctl -n hw.ncpu:

$ sysctl hw.ncpu
hw.ncpu: 8

In this case, we see that this Mac has eight cores.

Likewise find out the amount of RAM available:

BASH

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       1.1Gi       110Mi        97Mi       2.6Gi       2.6Gi
Swap:          199Mi       199Mi        60Ki

In this case, the machine has 3.7GiB of total RAM.

On macOS, the command is sysctl hw.memsize:

$ sysctl hw.memsize
hw.memsize: 34359738368

This machine has 34,359,738,368 bytes of RAM in total. Dividing by the number of bytes in 1GiB (\(1024^3\) bytes), that becomes 32GiB RAM.

We don’t want to use all of this RAM, but if we don’t mind other applications being unresponsive while our workflow runs, we can use the majority of it.

Finally, to check the available disk space, on the current partition:

BASH

$ df -h .

(or df -h without the . to show all partitions) This is the same on both macOS and Linux.

Parallel jobs in Snakemake


You may want to see the relevant part of the Snakemake documentation.

We’ll force all the intermediary steps to re-run by using the --forceall flag to Snakemake and time the whole run using the time command.

BASH

$ time snakemake --jobs 1 --forceall -- assets/plots/spectrum.pdf
...
TODO TIME

Measuring how concurrency affects execution time

What is the wallclock time reported by the above command? We’ll work out the average for everyone present, or if you are working through the material on your own, repeat the measurement three times to get your own average.

Now change the Snakemake concurrency option to --jobs 2 and then --jobs 4.

  • How does the total execution time change?
  • What factors do you think limit the power of this setting to reduce the execution time?

The time will vary depending on the system configuration but somewhere around TODO seconds is expected, and this should reduce to around TODO secs with --jobs 2 but higher --jobs will produce diminishing returns.

Things that may limit the effectiveness of parallel execution include:

  • The number of processors in the machine
  • The number of jobs in the DAG which are independent and can therefore be run in parallel
  • The existence of single long-running jobs
  • The amount of RAM in the machine
  • The speed at which data can be read from and written to disk

There are a few gotchas to bear in mind when using parallel execution:

  1. Parallel jobs will use more RAM. If you run out then either your OS will swap data to disk, or a process will crash.
  2. Parallel jobs may trip over each other if they try to write to the same filename at the same time (this can happen with temporary files).
  3. The on-screen output from parallel jobs will be jumbled, so save any output to log files instead.

Multi-thread rules in Snakemake


In the diagram at the top, we showed jobs with 2 and 8 threads. These are defined by adding a threads: block to the rule definition. We could do this for the ps_mass rule:

rule ps_mass:
    input:
        data="raw_data/beta{beta}/out_corr",
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot="intermediary_data/beta{beta}/corr.ps_eff_mass.{plot_filetype}",
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start: lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_start"),
        plateau_end: lookup(within=metadata, query="beta = {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    threads: 4
    shell:
        "python -m su2pg_analysis.meson_mass {input.data} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {plot_styles} |& tee {log.messages}"

You should explicitly use threads: 4 rather than params: threads = "4" because Snakemake considers the number of threads when scheduling jobs. Also, if the number of threads requested for a rule is less than the number of available processors then Snakemake will use the lower number.

Snakemake uses the threads variable to set common environment variables like OMP_NUM_THREADS. If you need to pass the number explicitly to your program, you can use the {threads} placeholder to get it.

Fine-grained profiling

Rather than timing the entire workflow, we can ask Snakemake to benchmark an individual rule.

For example, to benchmark the ps_mass step we could add this to the rule definition:

rule ps_mass:
    benchmark:
        "benchmarks/ps_mass.beta{beta}.txt"
    ...

The dataset here is so small that the numbers are tiny, but for real data this can be very useful as it shows time, memory usage and IO load for all jobs.

Running jobs on a cluster


Learning about clusters is beyond the scope of this course, but can be essential for more complex workflows working with large amounts of data.

When working with Snakemake, there are two options to getting the workflow running on a cluster:

  1. Similarly to most tools, we may install Snakemake on the cluster, write a job script, and execute Snakemake on our workflow inside a job.

  2. We can teach Snakemake how to run jobs on the cluster, and run our workflow from our own computer, having Snakemake do the work of submitting and monitoring the jobs for us.

To run Snakemake in the second way, someone will need to determine the right parameters for your particular cluster and save them as a profile. Once this is working, you can share the profile with other users on the cluster, so discuss this with your cluster sysadmin.

Instructions for configuring the Slurm executor plugin can be found in the Snakemake plugin catalog, along with the drmaa, cluster-generic and cluster-sync plugins which can support PBS, SGE and other cluster schedulers.

![][fig-cluster]

Running workflows on HPC or Cloud systems could be a whole course in itself. The topic is too important not to be mentioned here, but also complex to teach because you need a cluster to work on.

If you are teaching this lesson and have institutional HPC then ideally you should liaise with the administrators of the system to make a suitable installation of a recent Snakemake version and a profile to run jobs on the cluster job scheduler. In practise this may be easier said than done!

If you are able to demonstrate Snakemake running on cloud as part of a workshop then we’d much appreciate any feedback on how you did this and how it went.

Cluster demo

A this point in the course there may be a cluster demo…

Representation of a computer with four microchip icons indicating four available cores. To the right are five small green boxes representing Snakemake jobs and labelled as wanting 1, 1, 1, 2 and 8 threads respectively. ‘} [fig-cluster]: fig/cluster.jpg {alt=’ A photo of some high performance computer hardware racked in five cabinets in a server room. Each cabinet is about 2.2 metres high and 0.8m wide. The doors of the cabinets are open to show the systems inside. Orange and yellow cabling is prominent, connecting ports within the second and third racks. ’}

Key Points

  • To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
  • You also need to consider RAM, disk, and network bottlenecks
  • Profile your jobs to see what is taking most resources
  • Snakemake is great for running workflows on compute clusters