Summary and Schedule

Being able to reliably run a range of tools, across multiple inputs, and in the right order, is a very common problem in computing, including in scientific data analysis. If you develop scripts for data analysis from scratch, you will usually end up running up against these challenges. Rather than reinventing the wheel, we can lean on existing tools to do this for us, letting us focus on the aspects unique to our own work. Such tools are called workflow managers._

In this lesson, we introduce Snakemake, a workflow manager originally developed for bioinformatics applications, but that maps well onto the needs of data analysis in lattice quantum field theory.

By defining rules, each of which specify how to translate one or more input files into one or more output files, we can build up a workflow that takes our raw data as input, and produces plots, tables, and other definitions that we can include in our publications.

Acknowledgments

The development of this lesson was supported by the STFC Research Software Engineer Fellowship EP/V052489/1. The data presented were generated using the Supercomputing Wales SUNBIRD system. Supercomputing Wales is supported by the European Regional Development Fund via Welsh Government.

The author is grateful to Alexis Provatas, Gaurav Ray, and Gianmarco Simonetti, for testing and giving feedback on early versions of this lesson.

Setup Instructions Download files required for the lesson

Duration: 00h 00m 1. Running commands with Snakemake How do I run a simple command with Snakemake?

Duration: 00h 25m 2. Running Python code with Snakemake How do I configure an environment to run Python with Snakemake?

Duration: 00h 50m 3. Placeholders and wildcards How do I make a generic rule?
How does Snakemake decide what rule to run?

Duration: 01h 25m 4. Chaining rules How do I combine rules into a workflow?
How can I make a rule with multiple input files?
How should I refer to multiple files with similar names?

Duration: 02h 15m 5. Metadata and parameters How do I specify and configure parameters my whole workflow relies on?
How do I set up parameters for individual jobs?

Duration: 03h 00m 6. Multiple inputs and outputs How do I write rules that require or use more than one file, or class of file?
How do I tell Snakemake to not delete log files when jobs fail?
What do Snakemake errors look like, and how do I read them?

Duration: 03h 35m 7. How Snakemake plans jobs How do I visualise a Snakemake workflow?
How does Snakemake avoid unecessary work?
How do I control what steps will be run?

Duration: 03h 55m 8. Optimising workflow performance What compute resources are available on my system?
How do I define jobs with more than one thread?
How do I measure the compute resources being used by a workflow?
How do I run my workflow steps in parallel?

Duration: 04h 25m 9. Awkward corners How can I look up metadata based on wildcard values?
How can I select different numbers of input files
depending on wildcard values?
How can I tell Snakemake not to regenerate a file?

Duration: 05h 20m 10. Tidying up How do I split a Snakefile into manageable pieces?
How do I avoid needing to list every file to generate in my snakemake call?
What should I bear in mind when using Git for my Snakemake workflow?
What should I include in the README for my workflow?

Duration: 05h 55m 11. Publishing your workflow How do I verify that my workflow is ready to upload?
How do I prepare a single archive of my workflow and its dependencies?

Duration: 06h 10m Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Software installation

These instructions set out how to obtain and install the software and data on Linux. It is assumed that you have:

access to the Bash or Zsh shell on a fairly modern Linux or macOS system
sufficient disk space (~1GB) to store the software and data

You do not need root/administrator access.

Data Sets

Download the data zip file and unzip it to your Desktop

Software Setup

Conda

We will use Conda both to install Snakemake itself, and to manage dependencies of our workflows. Miniforge provide a minimal Conda environment, on which we will build.

Discussion

Details

Download the correct file for your operating system from the Miniforge repository, and execute it at the terminal.

Windows

This lesson has not been designed to run on Windows. We would recommend using the Windows Subsystem for Linux, and following the instructions for Linux.

macOS, Linux

BASH

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

When prompted, unless you have a reason not to, pick the option for Conda to set up your environment with conda init.

Callout

Update Conda

If you already have Conda installed (for example, as part of Anaconda), you’re recommended to update to the latest version, since Snakemake relies on some features introduced in recent versions.

To do this, run

conda update --name base --upgrade conda

If you’re not able to update the base environment, you may need to create a fresh Conda installation using the instructions above.

Snakemake

With Conda available, we can create an environment containing Snakemake and its dependencies. This can be used not just for this lesson, but for your work in Snakemake going forward.

Discussion

Details

BASH

conda create -n snakemake -c conda-forge -c bioconda snakemake
conda activate snakemake

After starting a new terminal, or rebooting your computer, you will need to run

BASH

conda activate snakemake

in order to activate the environment to be able to use Snakemake.

LaTeX

We will be using Matplotlib to generate plots formatted with LaTeX, which relies on having LaTeX installed.

Windows

This lesson has not been tested with Windows. You may try using MikTeX

macOS, Linux

Follow the instructions on TeX Live’s website.