Summary and Setup
Being able to reliably run a range of tools, across multiple inputs, and in the right order, is a very common problem in computing, including in scientific data analysis. If you develop scripts for data analysis from scratch, you will usually end up running up against these challenges. Rather than reinventing the wheel, we can lean on existing tools to do this for us, letting us focus on the aspects unique to our own work. Such tools are called workflow managers._
In this lesson, we introduce Snakemake, a workflow manager originally developed for bioinformatics applications, but that maps well onto the needs of data analysis in lattice quantum field theory.
By defining rules, each of which specify how to translate one or more input files into one or more output files, we can build up a workflow that takes our raw data as input, and produces plots, tables, and other definitions that we can include in our publications.
Acknowledgments
The development of this lesson was supported by the STFC Research Software Engineer Fellowship EP/V052489/1. The data presented were generated using the Supercomputing Wales SUNBIRD system. Supercomputing Wales is supported by the European Regional Development Fund via Welsh Government.
The author is grateful to Alexis Provatas, Gaurav Ray, and Gianmarco Simonetti, for testing and giving feedback on early versions of this lesson.
Software installation
These instructions set out how to obtain and install the software and data on Linux. It is assumed that you have:
- access to the Bash or Zsh shell on a fairly modern Linux or macOS system
- sufficient disk space (~1GB) to store the software and data
You do not need root/administrator access.
Data Sets
Download the data zip file and unzip it to your Desktop
Software Setup
Conda
We will use Conda both to install Snakemake itself, and to manage dependencies of our workflows. Miniforge provide a minimal Conda environment, on which we will build.
Details
Download the correct file for your operating system from the Miniforge repository, and execute it at the terminal.
This lesson has not been designed to run on Windows. We would recommend using the Windows Subsystem for Linux, and following the instructions for Linux.
Update Conda
If you already have Conda installed (for example, as part of Anaconda), you’re recommended to update to the latest version, since Snakemake relies on some features introduced in recent versions.
To do this, run
conda update --name base --upgrade conda
If you’re not able to update the base environment, you may need to create a fresh Conda installation using the instructions above.
Snakemake
With Conda available, we can create an environment containing Snakemake and its dependencies. This can be used not just for this lesson, but for your work in Snakemake going forward.
LaTeX
We will be using Matplotlib to generate plots formatted with LaTeX, which relies on having LaTeX installed.
This lesson has not been tested with Windows. You may try using MikTeX