Instructor Notes

This is a placeholder file. Please add content here.

Running commands with Snakemake

Environment activation

The most likely issue learners will encounter here is needing to activate their Snakemake environment when they have opened a fresh terminal. This is hopefully as simple as

conda activate snakemake

If Conda isn’t set up to automatically activate itself on starting a shell session, they may also need to run something like

source ~/miniconda3/bin/activate

where the exact path to run will depend on their specific setup.

Learners less experienced with the shell may want to cd into directories to edit files; if they do this and forget to cd back out again, they will encounter difficulties as Snakemake may not be able to find the Snakefile or the input files.

If they try to work around this, they may end up with multiple Snakefiles or ones with inputs pointing at incorrect relative paths.

Technically, you can specify absolute paths in Snakefiles, but this is not recommended, for portability reasons. For example, when using Snakemake to execute some rules on another machine, this would fail as it cannot gather the dependencies into the correct location; similarly if someone else were to run a workflow on their own machine, the home directory is unlikely to be the same, so the workflow would fail.

New researchers frequently like to hardcode absolute paths to their data, so this is an important point to reinforce.

Read-only data

The example data for this lesson uses read-only raw data throughout, including the containing directories. If learners end up with multiple copies of the data and need to delete one, they should use the commands:

chmod -R u+w raw_data
rm -r raw_data

Having the containing directories read-only means that extra output files can’t be added by accident. It’s a relatively strict measure—while assembling data, one would only want the files read-only, so you could keep adding more files as they were ready.

Use of the --forceall flag

In the first few episodes we always run Snakemake with the --forceall flag, and it’s not explained what this does until Ep. 04. The rationale is that the default Snakemake behaviour when pruning the DAG leads to learners seeing different output (typically the message “nothing to be done”) when repeating the exact same command. This can seem strange to learners who are used to scripting and imperative programming.

The internal rules used by Snakemake to determine which jobs in the DAG are to be run, and which skipped, are pretty complex, but the behaviour seen under --forceall is much more simple and consistent; Snakemake simply runs every job in the DAG every time. You can think of --forceall as disabling the lazy evaluation feature of Snakemake, until we are ready to properly introduce and understand it.

Running Python code with Snakemake

Placeholders and wildcards

Instructor Note

If the learner copies down a previous command here, then they might include a --use-conda. In that case, Snakemake will build the Conda environments, even though it will not need to use them.

Chaining rules

Metadata and parameters

Instructor Note

CSVs aren’t the only way to do this; for more complex data, YAML or even JSON may be a better choice. But CSV is good for most purposes, and easier to get started with. It’s also more readable for non-specialists investigating the workflow, which is valuable in and of itself.

Multiple inputs and outputs

Definitely run through the spectrum plot

This plot is referred to from subsequent lessons, so you definitely need to go through it.

How Snakemake plans jobs

Version differences

Older versions of Snakemake only support outputting the DAG in dot format, so that argument is not needed there.

Optimising workflow performance

Running on cluster and cloud

Running workflows on HPC or Cloud systems could be a whole course in itself. The topic is too important not to be mentioned here, but also complex to teach because you need a cluster to work on.

If you are teaching this lesson and have institutional HPC then ideally you should liaise with the administrators of the system to make a suitable installation of a recent Snakemake version and a profile to run jobs on the cluster job scheduler. In practise this may be easier said than done!

If you are able to demonstrate Snakemake running on cloud as part of a workshop then we’d much appreciate any feedback on how you did this and how it went.

Awkward corners

Closure alternatives

Here we choose to use a closure (a function returned by another function, where the former’s behaviour depends on the arguments to the latter), so that the same code can be used for both the start and the end of the plateau. There are other ways to phrase this: you could define a free function get_plateau(wildcards, position), and then in the rule definition, use functools.partial(get_plateau, position=...) to set the position, or use a lambda lambda wildcards: get_plateau(wildcards, ...). We choose the closure here because defining functions should already be familiar to most learners, and passing functions as values needs to be learned anyway (since we have to pass one to Snakemake), whereas lambdas and functools may not be familiar, and aren’t needed elsewhere in the lesson.

Change of command line

If learners skipped over the previous section, note that we’ve changed the standard snakemake call we were previously using: now, we don’t use --forceall, so we only regenerate when necessary (which makes the run quicker). --cores all, meanwhile, tells Snakemake to use all available CPU cores. In this case it doesn’t make a difference, since only one job is needed by this run, but it’s a useful default invocation for production use.

Instructor Notes

Running commands with Snakemake

Environment activation

Directories

Read-only data

Use of the --forceall flag

Running Python code with Snakemake

Placeholders and wildcards

Instructor Note

Chaining rules

Metadata and parameters

Instructor Note

Multiple inputs and outputs

Definitely run through the spectrum plot

How Snakemake plans jobs

Version differences

Optimising workflow performance

Running on cluster and cloud

Awkward corners

Closure alternatives

Change of command line

Tidying up

Publishing your workflow

Instructor Note