Running commands with Snakemake


  • Before running Snakemake you need to write a Snakefile
  • A Snakefile is a text file which defines a list of rules
  • Rules have inputs, outputs, and shell commands to be run
  • You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Running Python code with Snakemake


Placeholders and wildcards


  • Snakemake rules are made generic with placeholders and wildcards
  • Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
  • Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
  • Snakemake checks for various error conditions and will stop if it sees a problem

Chaining rules


  • Snakemake links up rules by iteratively looking for rules that make missing inputs
  • Careful choice of filenames allows this to work
  • Rules may have multiple named input files (and output files)
  • Use expand() to generate lists of filenames from a template

Metadata and parameters


Multiple inputs and outputs


How Snakemake plans jobs


  • A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
  • Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
  • If output files already exist, Snakemake can skip parts of the DAG
  • Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Optimising workflow performance


  • To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
  • You also need to consider RAM, disk, and network bottlenecks
  • Profile your jobs to see what is taking most resources
  • Snakemake is great for running workflows on compute clusters

Awkward corners


  • Use Python input functions that take a dict of wildcards, and return a list of strings, to handle complex dependency issues that can’t be expressed in pure Snakemake.
  • Use glob_wildcards to match multiple files on disk matching a specific pattern. Don’t rely on this finding intermediate files in final production workflows, since it won’t find files not present at the start of the workflow run.
  • Use snakemake --touch if you need to mark files as up-to-date, so that Snakemake won’t try to regenerate them.