Running commands with Snakemake


  • Before running Snakemake you need to write a Snakefile
  • A Snakefile is a text file which defines a list of rules
  • Rules have inputs, outputs, and shell commands to be run
  • You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Running Python code with Snakemake


  • Snakemake will manage Conda environments for you, to help ensure that workflows always use a consistent set of packages
  • Use the --use-conda option to snakemake to enable this behaviour
  • Use conda: to specify a Conda environment definition (.yml) file. The path of this is relative to the file in which it is specified.
  • Conda environment files are conventionally put in the workflow/envs directory

Placeholders and wildcards


  • Snakemake rules are made generic with placeholders and wildcards
  • Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
  • Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
  • Snakemake checks for various error conditions and will stop if it sees a problem

Chaining rules


  • Snakemake links up rules by iteratively looking for rules that make missing inputs
  • Careful choice of filenames allows this to work
  • Rules may have multiple named input files (and output files)
  • Use expand() to generate lists of filenames from a template

Metadata and parameters


  • Use a YAML file to define parameters to the workflow, and attach it using configfile: near the top of the file.
  • Override individual options at run-time with the --config option.
  • Load additional parameter files at run-time using the --configfile option.
  • Use a CSV file loaded into a Pandas dataframe to load ensemble-specific metadata.
  • Use lookup() to get information out of the dataframe in a rule.
  • Use params: to define job-specific parameters that do not describe filenames.

Multiple inputs and outputs


  • Rules can have multiple inputs and outputs, separated by commas
  • Use name=value to give names to inputs/outputs
  • Inputs themselves can be lists
  • Use placeholders like {input.name} to refer to single named inputs
  • Where there are multiple inputs, {input} will insert them all, separated by spaces
  • Use log: to list log outputs, which will not be removed when jobs fail
  • Errors are an expected part developing Snakemake workflows, and usually give enough information to track down what is causing them

How Snakemake plans jobs


  • A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
  • Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
  • If output files already exist, Snakemake can skip parts of the DAG
  • Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Optimising workflow performance


  • To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
  • You also need to consider RAM, disk, and network bottlenecks
  • Profile your jobs to see what is taking most resources
  • Use --cores all to enable using all CPU cores
  • Snakemake is great for running workflows on compute clusters

Awkward corners


  • Use Python input functions that take a dict of wildcards, and return a list of strings, to handle complex dependency issues that can’t be expressed in pure Snakemake.
  • Import glob.glob to match multiple files on disk matching a specific pattern. Don’t rely on this finding intermediate files in final production workflows, since it won’t find files not present at the start of the workflow run.
  • Use snakemake --touch if you need to mark files as up-to-date, so that Snakemake won’t try to regenerate them.

Tidying up


  • Use .smk files in workflow/rules to compartmentalise the Snakefile, and use input: lines in the main Snakefile to link them into the workflow.
  • Add a rule at the top of the Snakefile with default_target: True to specify the default output of a workflow.
  • Use .gitignore to avoid committing input or output data or the Snakemake cache.
  • Use .git_keep files to preserve empty directories.
  • Use Git submodules to link to libraries you have written that aren’t on PyPI.
  • Include a README.md in your repository explaining how to run the workflow.

Publishing your workflow


  • Test the workflow by making a fresh clone and following the README instructions
  • Use zip -9 --exclude "**/.git/*" --exclude "**/.git" filename.zip dirname to prepare a ZIP file of a freshly-cloned repository, with submodules.