Chaining rules

Last updated on 2025-10-09 | Edit this page

Overview

Questions

How do I combine rules into a workflow?
How can I make a rule with multiple input files?
How should I refer to multiple files with similar names?

Objectives

Use Snakemake to compute and then plot the average plaquettes of multiple ensembles
Understand how rules are linked by filename patterns
Be able to use multiple input files in one rule

A pipeline of multiple rules

We have so far been able to count the number of generated trajectories, and compute the average plaquette, given an output log from the configuration generation. However, an individual average plaquette is not interesting in isolation; what is more interesting is how it varies between different values of the input parameters. To do this, we will need to take the output of the avg_plaquette rule that we defined earlier, and use it as input for another rule.

Let’s define that rule now:

# Take individual data files for average plaquette and plot combined results
rule plot_avg_plaquette:
    input:
        "intermediary_data/beta1.8/pg.plaquette.json.gz",
        "intermediary_data/beta2.0/pg.plaquette.json.gz",
        "intermediary_data/beta2.2/pg.plaquette.json.gz",
    output:
        "assets/plots/plaquette_scan.pdf"
    conda: "envs/analysis.yml"
    shell:
        "python src/plot_plaquette.py {input} --output_filename {output}"

Callout

You can see that here we’re putting “files that want to be included in a paper” in an assets directory, similarly to the raw_data and intermediary_data directories we discussed in a previous episode. It can be useful to further distinguish plots, tables, and other definitions, by using subdirectories in this directory.

Rather than one input, as we have seen in rules so far, this rule requires three. When Snakemake substitutes these into the {input} placeholder, it will automatically add a space between them. Let’s test this now:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Look at the logging messages that Snakemake prints in the terminal. What has happened here?

Snakemake looks for a rule to make assets/plots/plaquette_scan.pdf
It determines that the plot_avg_plaquette rule can do this, if it has intermediary_data/beta1.8/pg.plaquette.json.gz, intermediary_data/beta2.0/pg.plaquette.json.gz, and intermediary_data/beta2.2/pg.plaquette.json.gz.
Snakemake looks for a rule to make intermediary_data/beta1.8/pg.plaquette.json.gz
It determines that avg_plaquette can make this if subdir=beta1.8
It sees that the input needed is therefore raw_data/beta1.8/out_pg
Now Snakemake has reached an available input file, it runs the avg_plaquette rule.
It then looks through the other two \(\beta\) values in turn, repeating the process until it has all of the needed inputs.
Finally, it runs the plot_avg_plaquette rule.

Here’s a visual representation of this process:

A visual representation of the above process showing the rule definitions, with arrows added to indicate the order wildcards and placeholders are substituted. Blue arrows start from the input of the `plot_avg_plaquette` rule, which are the files `intermediary_data/beta1.8/pg.plaquette.json.gz`, `intermediary_data/beta2.0/pg.plaquette.json.gz`, and `intermediary_data/beta2.2/pg.plaquette.json.gz`, then point down from components of the filename to wildcards in the output of the `avg_plaquette` rule. Orange arrows then track back up through the shell parts of both rules, where the placeholders are, and finally back to the target output filename at the top.

This, in a nutshell, is how we build workflows in Snakemake.

Define rules for all the processing steps
Choose input and output naming patterns that allow Snakemake to link the rules
Tell Snakemake to generate the final output files

If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

Callout

Choosing file name patterns

Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it, and most times there are several options that will work, but in all cases the file names you choose will need to be consistent and unabiguous.

Making file lists easier

In the rule above, we plotted the average plaquette for three values of \(\beta\) by listing the files expected to contain their values. In fact, we have data for a larger number of \(\beta\) values, but typing out each file by hand would be quite cumbersome. We can make use of the expand() function to do this more neatly:

    input:
        expand(
            "intermediary_data/beta{beta}/pg.plaquette.json.gz",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        ),

The first argument to expand() here is a template for the filename, and subsequent keyword arguments are lists of variables to fill into the placeholders. The output is the cartesian product of all the parameter lists.

We can check that this works correctly:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/plots/plaquette_scan.pdf

Challenge

Tabulating trajectory counts

The script src/tabulate_counts.py will take a list of files containing trajectory counts, and output them in a LaTeX table. Write a rule to generate this table for all values of \(\beta\), and output it to assets/tables/counts.tex.

Solution 1

The rule should look like:

# Output a LaTeX table of trajectory counts
rule tabulate_counts:
    input:
        expand(
            "intermediary_data/beta{beta}/pg.count",
            beta=[1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5],
        )
    output: "assets/tables/counts.tex"
    conda: "envs/analysis.yml"
    shell:
        "python src/tabulate_counts.py {input} > {output}"

To test this, for example:

snakemake --cores 1 --forceall --printshellcmds --use-conda assets/tables/counts.tex

Challenge

Tabulating trajectory counts (continued)

This setup currently requires reading the value of \(\beta\) from the filename. Why is this not ideal? How would the workflow need to be changed to avoid this?

Solution 2

It’s easy for files to be misnamed when creating or copying them. Putting the wrong data into the file is harder, especially when it’s a raw data file generated by the same program as the rest of the data. (If the wrong value were given as input, this could happen, but the corresponding output data would also be generated at that incorrect value. Provided the values are treated consistently, the downstream analysis could in fact still be valid, just not exactly as intended.)

Currently, grep -c is used to count the number of trajectories. This would need to be replaced or supplemented with a tool that read out the value of \(\beta\) from the input log, and outputs it along with the trajectory count. The src/tabulate_counts.py script could then be updated to use this number, rather than the filename.

In fact, the plaquette module does just this; in addition to the average plaquette, it also records the number of trajectories generated as part of the metadata and provenance information it tracks.

Key Points

Snakemake links up rules by iteratively looking for rules that make missing inputs
Careful choice of filenames allows this to work
Rules may have multiple named input files (and output files)
Use expand() to generate lists of filenames from a template