Tidying up

Last updated on 2025-10-29 | Edit this page

Estimated time: 35 minutes

Overview

Questions

How do I split a Snakefile into manageable pieces?
How do I avoid needing to list every file to generate in my snakemake call?
What should I bear in mind when using Git for my Snakemake workflow?
What should I include in the README for my workflow?

Objectives

Be able to avoid having long monolithic Snakefiles
Be able to have Snakemake generate a set of targets without explicitly specifying them
Understand how to effectively use Git to version control Snakemake worfklows
Be able to write an effective README on how to use a workflow

Breaking up the Snakefile

So far, we have written all of our rules into a single long file called Snakefile. As we continue to add more rules, this can start to get unwieldy, and we might want to break it up into smaller pieces.

Let’s do this now. We can take the rules relating only to the pg output files, and place them into a new file workflow/rules/pg.smk. Since conda: directives are defined relative to the current file, we need to replace envs/analysis.yml with ../envs/analysis.yml when we do this.

In place of the rules we just moved, in the Snakefile, we add the line

include: "rules/pg.smk"

This tells Snakemake to take the contents of the new file we created, and place it at that point in the file. Unlike Python, where the import statement creates a new scope, in Snakemake, anything defined above the include line is available to the included code. So we are safe to use the configuration parameters and metadata that we load at the top of the file.

Challenge

Clean out the `Snakefile`

Sort the remaining rules in the Snakefile into additional .smk files. Place these in the rules subdirectory with pg.smk.

One possible breakdown to use would be

Rules relating to correlation function fits
Rules relating to the Wilson flow
Rules combining output of more than one ensemble

Hint: conda: environments are defined relative to the source file, so they will need to be adjusted with a ../.

Show me the solution

One option is to have four files in workflow/rules:

pg.smk, containing the rules count_trajectories and avg_plaquette
corr_fits.smk, containing the rules meson_mass and one_loop_matching
wflow.smk, containing the rule w0
output.smk, containing the rules plot_avg_plaquette, tabulate_counts, restricted_spectrum, spectrum, and quick_spectrum.

In each one, any conda: directives should point to ../envs/analysis.yml.

Challenge

Tidy up the `plot_styles`

You might have noticed that we’re using the plot_styles configuration option as an argument to the plotting rules, but without including it in the input blocks. Since the argument is a filename, it is a good idea to let Snakemake know about this, so that if we adjust the style file, Snakemake knows to re-run the plotting rules.

Make that change now for all rules depending on plot_styles.

Show me the solution

The input: block should contain a new line similar to:

        plot_styles=config["plot_styles"],

The shell: block should then replace references to {config[plot_styles]} with {input.plot_styles}.

Challenge

Use the `metadata`

Currently a lot of our rules list a lot of values of beta. Can you tidy this up so that it uses the values from the metadata file instead?

Show me the solution

Each instance of beta= inside an expand may be replaced with:

beta=sorted(set(metadata["beta"]))

Alternatively, you may define this as a variable at the top of the file, and use it by name, rather than repeating it every time.

For the spectrum_scaled rule, you will additionally need to filter the metadata, for example via

beta=sorted(set(metadata[metadata["beta"] < 1.9]["beta"]))

A default target

When we come to run the full workflow to generate all of the assets for our publication, it is frustrating to need to list every single file we would like. It is much better if we can do this as part of the workflow, and ask Snakemake to generate a default set of targets.

We can do this by adding the following rule above all others in our Snakefile:

tables = expand("assets/tables/{table}.tex", table=["counts"])
plots = expand(
    f"assets/plots/{{plot}}{config['plot_filetype']}",
    plot=["plaquette_scan", "spectrum"],
)


rule all:
    input:
        plots=plots,
        tables=tables,
    default_target: True

Unlike the rules we looked at in the previous section, this one should stay in the main Snakefile. Note that the rule only has inputs: Snakemake sees that those files must be present for the rule to complete, so runs the rules necessary to generate them. When we call Snakemake with no targets, as

snakemake --cores all --use-conda

then Snakemake will look to the all rule, and make assets/tables/counts.tex, assets/plots/plaquette_scan.pdf, and assets/plots/spectrum.pdf, along with any intermediary files they depend on.

Callout

It can also be a good idea to add a provenance stamp at this point, where you create an additional data file listing all outputs the workflow generated, along with hashes of their contents, to make it more obvious if any files left over from previous workflow runs sneak into the output.

Using Snakemake with Git

Using a version control system such as Git is good practice when developing software of any kind, including analysis workflows. There are some steps that we can take to make our workflows fit more nicely into Git.

Firstly, we’d like to avoid committing files to the repository that aren’t part of the workflow definition. We make use of the file .gitignore in the repository root for this. In general, our Git repository should only contain things that should not change from run to run, such as the workflow definition and any workflow-specific code. Some examples of things that probably shouldn’t be committed include:

Input data and metadata. These should live and be shared separately from the workflow, as they are inputs to the analysis rather than a dedicated part of the analysis workflow. This means that someone wanting to read the code doesn’t need to download gigabytes of unwanted data.
Intermediary and output files. You may wish to share these with readers, and the output plots and tables should be included in your papers. But if a reader wants these from the workflow, they should download the input data and workflow, and run the latter. Mixing generated files with the repository will confuse matters, and make it so that any workflow re-runs create unwanted “uncommitted changes” that Git will notify you about.
.snakemake directory. Similarly to the intermediary and output files, this will change from machine to machine and should not be committed. (In particular, it will get quite large with Conda environments, which are only useful on the computer they were created on.)

Callout

If you quote data from another paper, and have had to transcribe this into a machine-readable format like CSV yourself, then this may be included in the repository, since it is not a data asset that you can reasonably publish in a data repository under your own name. In this case, the data’s provenance and attribution should be clearly stated.

Let’s check the .gitignore for our workflow now.

$ cat .gitignore
# Common temporary files
thumbs.db
.DS_Store

# Python compiled objects
*.pyc
__pycache__/
.mypy_cache/

# Snakemake cache
.snakemake/

# Workflow outputs
assets/
data_assets/
intermediary_data/

# Workflow inputs that should not be held together with code
raw_data/*
metadata/*

# Common editor temporary files
*~
.#*#

You can see that all of the above are being ignored, as are some common files to sneak into Git repositories—temporary files generated by text editors and cache files from operating systems, for example. This forms a good basic template for your own .gitignore files for Snakemake workflows; you may want to expand yours based on GitHub’s templates, for example.

Someone downloading your workflow will need to place the input data and metadata in the correct locations. You might wish to prepare an empty directory for the reader to place the necessary files into Since Git does not track empty directories, only files, we create an empty file in it, and update the .gitignore rules to un-ignore only that file.

touch metadata/.git_keep
nano .gitignore

OUTPUT

# Don't ignore placeholder files for would-be empty directories
!*/.git_keep

Let’s commit these changes now

git add .gitignore
git add metadata/.git_keep
git commit -m "Prepare empty directory for metadata"

Where you use libraries that you have written, that are not on the Python Package Index, it is a good idea to incorporate these as Git submodules in your workflow. While pip can install packages from GitHub repositories, this is not robust against you moving or closing your GitHub account later, or GitHub stopping offering free services. Instead, you can make available a ZIP file containing the workflow and its primary dependencies, which will continue to be installable significantly further into the future.

Currently we have a directory libs/su2pg_analysis, containing the library we have used for most of this analysis. This library is hosted on GitHub at https://github.com/edbennett/su2pg_analysis. To track this as a Git submodule, we use the command

$ git submodule add https://github.com/edbennett/su2pg_analysis libs/su2pg_analysis
Adding existing repo at 'libs/su2pg_analysis' to the index
$ git commit -m "Add su2pg_analysis as submodule"
[main 89c3d3d] Add su2pg_analysis as submodule
 2 files changed, 4 insertions(+)
 create mode 160000 libs/su2pg_analysis

We also need to commit the work we’ve done on the workflow itself. (We’ve put this off while we’ve been learning, but in general it’s a good idea to do this regularly during development, not only when you’ve “finished”.)

git add workflow/ config/
git status

If we’re happy with what is to be committed, we can then run

git commit -m "first draft of complete workflow"

Challenge

What to include?

Which of the following files should be included in a hypothetical workflow repository?

sort.py, a tool for putting results in the correct order to show in a paper.
out_corr, a correlation function output file.
Snakefile, the workflow definition.
.snakemake/, the Snakemake cache directory.
spectrum.pdf, a plot created by the workflow.
prd.mplstyle, a Matplotlib style file used by the workflow.
README.md, guidelines on how to use the workflow.
id_rsa, the SSH private key used to connect to clusters to run the workflow.

Show me the solution

Yes, this tool is clearly part of the workflow, so should be included in the repository, unless it’s being used from an external library.
No, this is input data, so is not part of the software. This should be part of a data release, as otherwise a reader won’t be able to run the workflow, but should be separate to the Git repository.
Yes, our workflow release wouldn’t be complete without the workflow definition.
No, this directory contains files that are specific to your computer, so aren’t useful to anyone else.
No, this will change from run to run, and is not part of the software. A reader can regenerate it from your code and data, by using the workflow.
Yes, while this is input to the workflow, it does not change based on the input data, and isn’t generated from the physics, so forms part of the workflow itself.
Yes, this is essential for a reader to be able to understand how to use the workflow, so should be part of the repository.
No, this is private information that may allow others to take over or abuse your supercomputing accounts. Be very careful not to put private information in public Git repositories!

`README`

In general, we would like others to be able to reproduce our work—this is a key aspect of the scientific method. To this end, it’s important to let them know how to use our workflow, to minimise the amount of effort that must be spent getting it running. By convention, this is done in a file called README, (with an appropriate file extension).

In this context, it’s good to remember that “other people” includes “your collaborators” and “yourself in six months’ time”, so writing a good README isn’t just good citizenship to help others, it’s also directly beneficial to you and those you work with.

One good format to write a README in is Markdown. This is rendered to formatted text automatically by most Git hosting services, including GitHub. When using this, the file is conventionally called README.md.

Things you should include in your README include:

A brief statement of what the workflow does, including a link to the paper it was written for.
A listing of what software is required, including links for installation instructions or downloads. (Snakemake is one such piece of software.)
Instructions on downloading the repository and necessary data, including a link to where the data can be found.
Instructions on how to run the workflow, including the suggested snakemake invocation.
An indication of the time required to run the workflow, and the hardware that that estimate applies to.
Details of where output from the workflow is placed.
An indication of how reusable the workflow is: is it intended to work with any set of input data, and has it been tested for this, or is it designed for the specific dataset being analysed?

Rather than writing from fresh, you may wish to work from a template for this. One suitable README template can be found in the TELOS Collaboration’s workflow template

Discussion

Write a README

Use the TELOS Collaboration template to draft a file README.md for the workflow we have developed.

Key Points

Use .smk files in workflow/rules to compartmentalise the Snakefile, and use input: lines in the main Snakefile to link them into the workflow.
Add a rule at the top of the Snakefile with default_target: True to specify the default output of a workflow.
Use .gitignore to avoid committing input or output data or the Snakemake cache.
Use .git_keep files to preserve empty directories.
Use Git submodules to link to libraries you have written that aren’t on PyPI.
Include a README.md in your repository explaining how to run the workflow.

Tidying up

Overview

Questions

Objectives

Breaking up the Snakefile

Clean out the Snakefile

Show me the solution

Tidy up the plot_styles

Show me the solution

Use the metadata

Show me the solution

A default target

Using Snakemake with Git

OUTPUT

What to include?

Show me the solution

README

Write a README

Clean out the `Snakefile`

Tidy up the `plot_styles`

Use the `metadata`

`README`