Running commands with Snakemake
- Before running Snakemake you need to write a Snakefile
- A Snakefile is a text file which defines a list of rules
- Rules have inputs, outputs, and shell commands to be run
- You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule
Running Python code with Snakemake
- Snakemake will manage Conda environments for you, to help ensure that workflows always use a consistent set of packages
- Use the --use-condaoption tosnakemaketo enable this behaviour
- Use conda:to specify a Conda environment definition (.yml) file. The path of this is relative to the file in which it is specified.
- Conda environment files are conventionally put in the
workflow/envsdirectory
Placeholders and wildcards
- Snakemake rules are made generic with placeholders and wildcards
- Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
- Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
- Snakemake checks for various error conditions and will stop if it sees a problem
Chaining rules
- Snakemake links up rules by iteratively looking for rules that make missing inputs
- Careful choice of filenames allows this to work
- Rules may have multiple named input files (and output files)
- Use expand()to generate lists of filenames from a template
Metadata and parameters
- Use a YAML file to define parameters to the workflow, and attach it
using configfile:near the top of the file.
- Override individual options at run-time with the
--configoption.
- Load additional parameter files at run-time using the
--configfileoption.
- Use a CSV file loaded into a Pandas dataframe to load ensemble-specific metadata.
- Use lookup()to get information out of the dataframe in a rule.
- Use params:to define job-specific parameters that do not describe filenames.
Multiple inputs and outputs
- Rules can have multiple inputs and outputs, separated by commas
- Use name=valueto give names to inputs/outputs
- Inputs themselves can be lists
- Use placeholders like {input.name}to refer to single named inputs
- Where there are multiple inputs, {input}will insert them all, separated by spaces
- Use log:to list log outputs, which will not be removed when jobs fail
- Errors are an expected part developing Snakemake workflows, and usually give enough information to track down what is causing them
How Snakemake plans jobs
- A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
- Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
- If output files already exist, Snakemake can skip parts of the DAG
- Snakemake compares file timestamps and a log of previous runs to determine what need regenerating
Optimising workflow performance
- To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
- You also need to consider RAM, disk, and network bottlenecks
- Profile your jobs to see what is taking most resources
- Use --cores allto enable using all CPU cores
- Snakemake is great for running workflows on compute clusters
Awkward corners
- Use Python input functions that take a dict of wildcards, and return a list of strings, to handle complex dependency issues that can’t be expressed in pure Snakemake.
- Import glob.globto match multiple files on disk matching a specific pattern. Don’t rely on this finding intermediate files in final production workflows, since it won’t find files not present at the start of the workflow run.
- Use snakemake --touchif you need to mark files as up-to-date, so that Snakemake won’t try to regenerate them.
Tidying up
- Use .smkfiles inworkflow/rulesto compartmentalise theSnakefile, and useinput:lines in the mainSnakefileto link them into the workflow.
- Add a rule at the top of the Snakefilewithdefault_target: Trueto specify the default output of a workflow.
- Use .gitignoreto avoid committing input or output data or the Snakemake cache.
- Use .git_keepfiles to preserve empty directories.
- Use Git submodules to link to libraries you have written that aren’t on PyPI.
- Include a README.mdin your repository explaining how to run the workflow.
Publishing your workflow
- Test the workflow by making a fresh clone and following the README instructions
- Use
zip -9 --exclude "**/.git/*" --exclude "**/.git" filename.zip dirnameto prepare a ZIP file of a freshly-cloned repository, with submodules.