Running commands with Snakemake
- Before running Snakemake you need to write a Snakefile
- A Snakefile is a text file which defines a list of rules
- Rules have inputs, outputs, and shell commands to be run
- You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule
Running Python code with Snakemake
Placeholders and wildcards
- Snakemake rules are made generic with placeholders and wildcards
- Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
- Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
- Snakemake checks for various error conditions and will stop if it sees a problem
Chaining rules
- Snakemake links up rules by iteratively looking for rules that make missing inputs
- Careful choice of filenames allows this to work
- Rules may have multiple named input files (and output files)
- Use
expand()
to generate lists of filenames from a template
Metadata and parameters
Multiple inputs and outputs
How Snakemake plans jobs
- A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
- Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
- If output files already exist, Snakemake can skip parts of the DAG
- Snakemake compares file timestamps and a log of previous runs to determine what need regenerating
Optimising workflow performance
- To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
- You also need to consider RAM, disk, and network bottlenecks
- Profile your jobs to see what is taking most resources
- Snakemake is great for running workflows on compute clusters