Summary and Schedule
Being able to reliably run a range of tools, across multiple inputs, and in the right order, is a very common problem in computing, including in scientific data analysis. If you develop scripts for data analysis from scratch, you will usually end up running up against these challenges. Rather than reinventing the wheel, we can lean on existing tools to do this for us, letting us focus on the aspects unique to our own work. Such tools are called workflow managers._
In this lesson, we introduce Snakemake, a workflow manager originally developed for bioinformatics applications, but that maps well onto the needs of data analysis in lattice quantum field theory.
By defining rules, each of which specify how to translate one or more input files into one or more output files, we can build up a workflow that takes our raw data as input, and produces plots, tables, and other definitions that we can include in our publications.
Acknowledgments
The development of this lesson was supported by the STFC Research Software Engineer Fellowship EP/V052489/1. The data presented were generated using the Supercomputing Wales SUNBIRD system. Supercomputing Wales is supported by the European Regional Development Fund via Welsh Government.
The author is grateful to Alexis Provatas, Gaurav Ray, and Gianmarco Simonetti, for testing and giving feedback on early versions of this lesson.
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. Running commands with Snakemake | How do I run a simple command with Snakemake? | 
| Duration: 00h 25m | 2. Running Python code with Snakemake | How do I configure an environment to run Python with Snakemake? | 
| Duration: 00h 50m | 3. Placeholders and wildcards | How do I make a generic rule? How does Snakemake decide what rule to run? | 
| Duration: 01h 25m | 4. Chaining rules | How do I combine rules into a workflow? How can I make a rule with multiple input files? How should I refer to multiple files with similar names? | 
| Duration: 02h 15m | 5. Metadata and parameters | How do I specify and configure parameters my whole workflow relies
on? How do I set up parameters for individual jobs? | 
| Duration: 03h 00m | 6. Multiple inputs and outputs | How do I write rules that require or use more than one file, or class of
file? How do I tell Snakemake to not delete log files when jobs fail? What do Snakemake errors look like, and how do I read them? | 
| Duration: 03h 35m | 7. How Snakemake plans jobs | How do I visualise a Snakemake workflow? How does Snakemake avoid unecessary work? How do I control what steps will be run? | 
| Duration: 03h 55m | 8. Optimising workflow performance | What compute resources are available on my system? How do I define jobs with more than one thread? How do I measure the compute resources being used by a workflow? How do I run my workflow steps in parallel? | 
| Duration: 04h 25m | 9. Awkward corners | How can I look up metadata based on wildcard values? How can I select different numbers of input files depending on wildcard values? How can I tell Snakemake not to regenerate a file? | 
| Duration: 05h 20m | 10. Tidying up | How do I split a Snakefile into manageable pieces? How do I avoid needing to list every file to generate in my snakemakecall?What should I bear in mind when using Git for my Snakemake workflow? What should I include in the README for my workflow? | 
| Duration: 05h 55m | 11. Publishing your workflow | How do I verify that my workflow is ready to upload? How do I prepare a single archive of my workflow and its dependencies? | 
| Duration: 06h 10m | Finish | 
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Software installation
These instructions set out how to obtain and install the software and data on Linux. It is assumed that you have:
- access to the Bash or Zsh shell on a fairly modern Linux or macOS system
- sufficient disk space (~1GB) to store the software and data
You do not need root/administrator access.
Data Sets
Download the data zip file and unzip it to your Desktop
Software Setup
Conda
We will use Conda both to install Snakemake itself, and to manage dependencies of our workflows. Miniforge provide a minimal Conda environment, on which we will build.
Details
Download the correct file for your operating system from the Miniforge repository, and execute it at the terminal.
This lesson has not been designed to run on Windows. We would recommend using the Windows Subsystem for Linux, and following the instructions for Linux.
Update Conda
If you already have Conda installed (for example, as part of Anaconda), you’re recommended to update to the latest version, since Snakemake relies on some features introduced in recent versions.
To do this, run
conda update --name base --upgrade condaIf you’re not able to update the base environment, you may need to create a fresh Conda installation using the instructions above.
Snakemake
With Conda available, we can create an environment containing Snakemake and its dependencies. This can be used not just for this lesson, but for your work in Snakemake going forward.
LaTeX
We will be using Matplotlib to generate plots formatted with LaTeX, which relies on having LaTeX installed.
This lesson has not been tested with Windows. You may try using MikTeX