Tidying up
Last updated on 2025-10-29 | Edit this page
Estimated time: 35 minutes
Overview
Questions
- How do I split a Snakefile into manageable pieces?
- How do I avoid needing to list every file to generate in my
snakemakecall?
- What should I bear in mind when using Git for my Snakemake workflow?
- What should I include in the README for my workflow?
Objectives
- Be able to avoid having long monolithic Snakefiles
- Be able to have Snakemake generate a set of targets without explicitly specifying them
- Understand how to effectively use Git to version control Snakemake worfklows
- Be able to write an effective README on how to use a workflow
Breaking up the Snakefile
So far, we have written all of our rules into a single long file
called Snakefile. As we continue to add more rules, this
can start to get unwieldy, and we might want to break it up into smaller
pieces.
Let’s do this now. We can take the rules relating only to the
pg output files, and place them into a new file
workflow/rules/pg.smk. Since conda: directives
are defined relative to the current file, we need to replace
envs/analysis.yml with ../envs/analysis.yml
when we do this.
In place of the rules we just moved, in the Snakefile,
we add the line
include: "rules/pg.smk"This tells Snakemake to take the contents of the new file we created,
and place it at that point in the file. Unlike Python, where the
import statement creates a new scope, in
Snakemake, anything defined above the include line is
available to the included code. So we are safe to use the configuration
parameters and metadata that we load at the top of the file.
Clean out the Snakefile
Sort the remaining rules in the Snakefile into
additional .smk files. Place these in the
rules subdirectory with pg.smk.
One possible breakdown to use would be
- Rules relating to correlation function fits
- Rules relating to the Wilson flow
- Rules combining output of more than one ensemble
Hint: conda: environments are defined relative to the
source file, so they will need to be adjusted with a
../.
One option is to have four files in workflow/rules:
- 
pg.smk, containing the rulescount_trajectoriesandavg_plaquette
- 
corr_fits.smk, containing the rulesmeson_massandone_loop_matching
- 
wflow.smk, containing the rulew0
- 
output.smk, containing the rulesplot_avg_plaquette,tabulate_counts,restricted_spectrum,spectrum, andquick_spectrum.
In each one, any conda: directives should point to
../envs/analysis.yml.
Tidy up the plot_styles
You might have noticed that we’re using the plot_styles
configuration option as an argument to the plotting rules, but without
including it in the input blocks. Since the argument is a
filename, it is a good idea to let Snakemake know about this, so that if
we adjust the style file, Snakemake knows to re-run the plotting
rules.
Make that change now for all rules depending on
plot_styles.
The input: block should contain a new line similar
to:
        plot_styles=config["plot_styles"],The shell: block should then replace references to
{config[plot_styles]} with
{input.plot_styles}.
Use the metadata
Currently a lot of our rules list a lot of values of
beta. Can you tidy this up so that it uses the values from
the metadata file instead?
Each instance of beta= inside an expand may be replaced
with:
beta=sorted(set(metadata["beta"]))Alternatively, you may define this as a variable at the top of the file, and use it by name, rather than repeating it every time.
For the spectrum_scaled rule, you will additionally need
to filter the metadata, for example via
beta=sorted(set(metadata[metadata["beta"] < 1.9]["beta"]))A default target
When we come to run the full workflow to generate all of the assets for our publication, it is frustrating to need to list every single file we would like. It is much better if we can do this as part of the workflow, and ask Snakemake to generate a default set of targets.
We can do this by adding the following rule above all others in our
Snakefile:
tables = expand("assets/tables/{table}.tex", table=["counts"])
plots = expand(
    f"assets/plots/{{plot}}{config['plot_filetype']}",
    plot=["plaquette_scan", "spectrum"],
)
rule all:
    input:
        plots=plots,
        tables=tables,
    default_target: TrueUnlike the rules we looked at in the previous section, this one should stay in the main Snakefile. Note that the rule only has inputs: Snakemake sees that those files must be present for the rule to complete, so runs the rules necessary to generate them. When we call Snakemake with no targets, as
snakemake --cores all --use-condathen Snakemake will look to the all rule, and make
assets/tables/counts.tex,
assets/plots/plaquette_scan.pdf, and
assets/plots/spectrum.pdf, along with any intermediary
files they depend on.
It can also be a good idea to add a provenance stamp at this point, where you create an additional data file listing all outputs the workflow generated, along with hashes of their contents, to make it more obvious if any files left over from previous workflow runs sneak into the output.
Using Snakemake with Git
Using a version control system such as Git is good practice when developing software of any kind, including analysis workflows. There are some steps that we can take to make our workflows fit more nicely into Git.
Firstly, we’d like to avoid committing files to the repository that
aren’t part of the workflow definition. We make use of the file
.gitignore in the repository root for this. In general, our
Git repository should only contain things that should not change from
run to run, such as the workflow definition and any workflow-specific
code. Some examples of things that probably shouldn’t be committed
include:
- Input data and metadata. These should live and be shared separately from the workflow, as they are inputs to the analysis rather than a dedicated part of the analysis workflow. This means that someone wanting to read the code doesn’t need to download gigabytes of unwanted data.
- Intermediary and output files. You may wish to share these with readers, and the output plots and tables should be included in your papers. But if a reader wants these from the workflow, they should download the input data and workflow, and run the latter. Mixing generated files with the repository will confuse matters, and make it so that any workflow re-runs create unwanted “uncommitted changes” that Git will notify you about.
- 
.snakemakedirectory. Similarly to the intermediary and output files, this will change from machine to machine and should not be committed. (In particular, it will get quite large with Conda environments, which are only useful on the computer they were created on.)
If you quote data from another paper, and have had to transcribe this into a machine-readable format like CSV yourself, then this may be included in the repository, since it is not a data asset that you can reasonably publish in a data repository under your own name. In this case, the data’s provenance and attribution should be clearly stated.
Let’s check the .gitignore for our workflow now.
$ cat .gitignore
# Common temporary files
thumbs.db
.DS_Store
# Python compiled objects
*.pyc
__pycache__/
.mypy_cache/
# Snakemake cache
.snakemake/
# Workflow outputs
assets/
data_assets/
intermediary_data/
# Workflow inputs that should not be held together with code
raw_data/*
metadata/*
# Common editor temporary files
*~
.#*#You can see that all of the above are being ignored, as are some
common files to sneak into Git repositories—temporary files generated by
text editors and cache files from operating systems, for example. This
forms a good basic template for your own .gitignore files
for Snakemake workflows; you may want to expand yours based on GitHub’s templates, for
example.
Someone downloading your workflow will need to place the input data
and metadata in the correct locations. You might wish to prepare an
empty directory for the reader to place the necessary files into Since
Git does not track empty directories, only files, we create an empty
file in it, and update the .gitignore rules to un-ignore
only that file.
touch metadata/.git_keep
nano .gitignoreOUTPUT
# Don't ignore placeholder files for would-be empty directories
!*/.git_keepLet’s commit these changes now
git add .gitignore
git add metadata/.git_keep
git commit -m "Prepare empty directory for metadata"Where you use libraries that you have written, that are not on the Python Package Index, it is a good idea to
incorporate these as Git submodules in your workflow. While
pip can install packages from GitHub repositories, this is
not robust against you moving or closing your GitHub account later, or
GitHub stopping offering free services. Instead, you can make available
a ZIP file containing the workflow and its primary dependencies, which
will continue to be installable significantly further into the
future.
Currently we have a directory libs/su2pg_analysis,
containing the library we have used for most of this analysis. This
library is hosted on GitHub at
https://github.com/edbennett/su2pg_analysis. To track this
as a Git submodule, we use the command
$ git submodule add https://github.com/edbennett/su2pg_analysis libs/su2pg_analysis
Adding existing repo at 'libs/su2pg_analysis' to the index
$ git commit -m "Add su2pg_analysis as submodule"
[main 89c3d3d] Add su2pg_analysis as submodule
 2 files changed, 4 insertions(+)
 create mode 160000 libs/su2pg_analysisWe also need to commit the work we’ve done on the workflow itself. (We’ve put this off while we’ve been learning, but in general it’s a good idea to do this regularly during development, not only when you’ve “finished”.)
git add workflow/ config/
git statusIf we’re happy with what is to be committed, we can then run
git commit -m "first draft of complete workflow"What to include?
Which of the following files should be included in a hypothetical workflow repository?
- 
sort.py, a tool for putting results in the correct order to show in a paper.
- 
out_corr, a correlation function output file.
- 
Snakefile, the workflow definition.
- 
.snakemake/, the Snakemake cache directory.
- 
spectrum.pdf, a plot created by the workflow.
- 
prd.mplstyle, a Matplotlib style file used by the workflow.
- 
README.md, guidelines on how to use the workflow.
- 
id_rsa, the SSH private key used to connect to clusters to run the workflow.
- Yes, this tool is clearly part of the workflow, so should be included in the repository, unless it’s being used from an external library.
- No, this is input data, so is not part of the software. This should be part of a data release, as otherwise a reader won’t be able to run the workflow, but should be separate to the Git repository.
- Yes, our workflow release wouldn’t be complete without the workflow definition.
- No, this directory contains files that are specific to your computer, so aren’t useful to anyone else.
- No, this will change from run to run, and is not part of the software. A reader can regenerate it from your code and data, by using the workflow.
- Yes, while this is input to the workflow, it does not change based on the input data, and isn’t generated from the physics, so forms part of the workflow itself.
- Yes, this is essential for a reader to be able to understand how to use the workflow, so should be part of the repository.
- No, this is private information that may allow others to take over or abuse your supercomputing accounts. Be very careful not to put private information in public Git repositories!
README
In general, we would like others to be able to reproduce our
work—this is a key aspect of the scientific method. To this end, it’s
important to let them know how to use our workflow, to minimise the
amount of effort that must be spent getting it running. By convention,
this is done in a file called README, (with an appropriate
file extension).
In this context, it’s good to remember that “other people” includes “your collaborators” and “yourself in six months’ time”, so writing a good README isn’t just good citizenship to help others, it’s also directly beneficial to you and those you work with.
One good format to write a README in is Markdown.
This is rendered to formatted text automatically by most Git hosting
services, including GitHub. When using this, the file is conventionally
called README.md.
Things you should include in your README include:
- A brief statement of what the workflow does, including a link to the paper it was written for.
- A listing of what software is required, including links for installation instructions or downloads. (Snakemake is one such piece of software.)
- Instructions on downloading the repository and necessary data, including a link to where the data can be found.
- Instructions on how to run the workflow, including the suggested
snakemakeinvocation.
- An indication of the time required to run the workflow, and the hardware that that estimate applies to.
- Details of where output from the workflow is placed.
- An indication of how reusable the workflow is: is it intended to work with any set of input data, and has it been tested for this, or is it designed for the specific dataset being analysed?
Rather than writing from fresh, you may wish to work from a template for this. One suitable README template can be found in the TELOS Collaboration’s workflow template
Write a README
Use the TELOS
Collaboration template to draft a file README.md for
the workflow we have developed.
- Use .smkfiles inworkflow/rulesto compartmentalise theSnakefile, and useinput:lines in the mainSnakefileto link them into the workflow.
- Add a rule at the top of the Snakefilewithdefault_target: Trueto specify the default output of a workflow.
- Use .gitignoreto avoid committing input or output data or the Snakemake cache.
- Use .git_keepfiles to preserve empty directories.
- Use Git submodules to link to libraries you have written that aren’t on PyPI.
- Include a README.mdin your repository explaining how to run the workflow.