Running commands with Snakemake
Last updated on 2025-07-28 | Edit this page
Overview
Questions
- How do I run a simple command with Snakemake?
Objectives
- Create a Snakemake recipe (a Snakefile)
- Use Snakemake to compute the average plaquette of an ensemble
Introduction
Data analysis in lattice quantum field theory generally has many moving parts: you will likely have many ensembles, with differing physical and algorithmic parameters, and for each many different observables may be computed. These need to be combined in different ways, making sure that compatible statistics are used. Making sure that each step runs in the correct order is non-trivial, requiring careful bookkeeping, especially if we want to update data as ensembles are extended, if we want to take advantage of parallelism to get results faster, and if we want auditability to be able to verify later what steps were performed.
While we could build up tools to do all of these things from scratch, these are not challenges exclusive to lattice, and so we can take advantage of others’ work rather than reinventing the wheel. This frees up our time to focus on the physics challenges. The category of “tools to help run complex arrangements of tools in the right order” is called “workflow management”; there are workflow managers available, most of which are specialised to a specific class of applications.
One workflow manager developed for scientific data analysis is called Snakemake; this will be the target of this lesson. Snakemake is similar to GNU Make, in that you create a text file containing rules specifying how input files are translated to output files, and then the software will work out what rules to run to generate a specified output from the available input files. Unlike Make, Snakemake uses a syntax closely based on Python, and files containing rules can be extended using standard Python syntax. It also has many quality-of-life improvements compared to Make, and so is much better suited for writing data analysis workflows.
At this point, you should have Snakemake already installed and available to you. To test this, we can open a terminal and run
$ snakemake --version
8.25.3
If you instead get a “command not found” error, go back to the setup and check that you have completed all the necessary steps.
Looking at the sample data
You should already have the sample data files unpacked. (If not,
refer back to the lesson setup.) Under the
su2pg/raw_data
directory, you will find a series of
subdirectories, each containing data for a single ensemble. In each are
files containing the log of the configuration generation, the
computation of the quenched meson spectrum, and the computation of the
Wilson flow.
The sample data are for the SU(2) pure Yang-Mills theory, and have been generated using the HiRep code.
Each log contains header lines describing the setup, information on the computation being computed, and results for observables computed on each configuration. Code to parse these logs and compute statistics is included with the sample data; we’ll use these in due course
Making a Snakefile
To start with, let’s define a rule to count the number of lines in one of the raw data files.
Within the su2pg/workflow
directory, edit a new text
file named Snakefile
. Into it, insert the following
content:
rule count_lines:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.count"
shell:
"wc -l raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"
Key points about this file
- The file is named
Snakefile
- with a capitalS
and no file extension. - Some lines are indented. Indents must be with space characters, not tabs.
- The rule definition starts with the keyword
rule
followed by the rule name, then a colon. - We named the rule
count_trajectories
. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword. - The keywords
input
,output
,shell
are all followed by a colon. - The file names and the shell command are all in
"quotes"
.
The first line tells Snakemake we are defining a new rule. Subsequent
indented lines form a part of this rule; while there are none here, any
subsequent unindented lines would not be included in the rule. The
input:
line tells Snakemake what files to look for to be
able to run this rule. If this file is missing (and there is no rule to
create it), Snakemake will not consider running this rule. The
output:
line tells Snakemake what files to expect the rule
to generate. If this file is not generated, then Snakemake will abort
the workflow with an error. Finally, the shell:
block tells
Snakemake what shell commands to run to get the specified output from
the given input.
Going back to the shell now, we can test this rule. From the
su2pg
directory, we can run the command
snakemake --jobs 1 --forceall --printshellcmds intermediary_data/beta2.0/pg.count
If we’ve made any transcription errors in the rule (missing quotes, bad indentations, etc.), then it will become clear at this point, as we’ll receive an error that we will need to fix.
For now, we will consistently run snakemake
with the
--jobs 1 --forceall --printshellcmds
options. As we move
through the lesson, we’ll explain in more detail when we need to modify
them.
Let’s check that the output was correctly generated:
$ cat intermediary_data/beta2.0/pg.count
31064 raw_data/beta2.0/out_pg
Callout
You might have noticed that we are grouping files into directories
like raw_data
and intermediary_data
. It is
generally a good idea to keep raw input data separate from data
generated by the analysis. This means that if you need to run a clean
analysis starting from your input data, then it is much easier to know
what to remove and what to keep. Ideally, the raw_data
directory should be kept read-only, so that you don’t accidentally
modify your input data. Similarly, it is a good idea to separate out
“files that you want to include in a paper” from “intermediary files
generated by the workflow but not needed in the paper”; we’ll talk more
about that in a later section.
You might also worry that your tooling will need to use
mkdir
to create these directories; in fact, Snakemake will
automatically create all directories where it expects to see output from
rules that it runs.
Running Snakemake
Run snakemake --help | less
to see the help for all
available options. What does the --printshellcmds
option in
the snakemake
command above do?
- Protects existing output files
- Prints the shell commands that are being run to the terminal
- Tells Snakemake to only run one process at a time
- Prompts the user for the correct input file
Hint: you can search in the text by pressing /
, and
quit back to the shell with q
- Prints the shell commands that are being run to the terminal
This is such a useful thing we don’t know why it isn’t the default!
The --jobs 1
option is what tells Snakemake to only run one
process at a time, and we’ll stick with this for now as it makes things
simpler. The --forceall
option tells Snakemake to always
recreate output files, and we’ll learn about protected outputs much
later in the course. Answer 4 is a total red herring, as Snakemake never
prompts interactively for user input.
Counting trajectories
The count of output lines isn’t particularly useful. Potentially more interesting is the number of trajectories in a given file. In a HiRep generation log, each trajectory concludes with a line of the form
OUTPUT
[MAIN][0]Trajectory #1: generated in [39.717707 sec]
We can use grep
to count these, as
Counting sequences in Snakemake
Modify the Snakefile to count the number of
trajectories in raw_data/beta2.0/out_pg
,
rather than the number of lines.
- Rename the rule to
count_trajectories
- Keep the output file name the same
- Remember that the result needs to go into the output file, not just be printed on the screen
- Test the new rule once it is done.
rule count_trajectories:
input: "raw_data/beta2.0/out_pg"
output: "intermediary_data/beta2.0/pg.count"
shell:
"grep -c generated raw_data/beta2.0/out_pg > intermediary_data/beta2.0/pg.count"
Key Points
- Before running Snakemake you need to write a Snakefile
- A Snakefile is a text file which defines a list of rules
- Rules have inputs, outputs, and shell commands to be run
- You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule