Awkward corners
Last updated on 2025-10-29 | Edit this page
Overview
Questions
- How can I look up metadata based on wildcard values?
- How can I select different numbers of input files depending on wildcard values?
- How can I tell Snakemake not to regenerate a file?
Objectives
- Understand how Snakemake being built on Python allows us to work around some shortcomings of Snakemake in some use cases.
- Understand how to handle trickier metadata and input file lookups.
- Be able to avoid Snakemake re-running a rule when this is not wanted.
Beyond the pseudoscalar: Input functions
Recall the rule that we have been working on to fit the correlation function of the pseudoscalar meson:
# Compute pseudoscalar mass and amplitude, read plateau from metadata,
# and plot effective mass
rule ps_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.ps_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.ps_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.ps_mass.log",
    params:
        plateau_start=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_start"),
        plateau_end=lookup(within=metadata, query="beta == {beta}", cols="ps_plateau_end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"Now, we are frequently interested in more symmetry channels than just
the pseudoscalar. In principle, we could make a copy of this rule, and
change ps to, for example v or
av. (We did this in an earlier challenge for the vector
channel.) However, just like adding more ensembles, this rapidly makes
our Snakefiles unwieldy and difficult to maintain. How can we adjust
this rule so that it works for any channel?
The first step is to replace ps with a wildcard. The
output should then use
corr.{channel}_mass.json rather than
corr.ps_mass.json, and similarly for the log. We will also
need to pass an argument --channel {wildcards.channel} in
the shell block, so that the code knows what channel it is
working with. Since the rule is no longer only for the ps
channel, it should be renamed, for example to
meson_mass.
What about the plateau start and end positions? Currently these are
found using a call to lookup, with the column hardcoded to
"ps_plateau_start" or "ps_plateau_start". We
can’t substitute a wildcard in here, so instead we need to make use of
an input function.
When Snakemake is given a Python function as a param,
input, or output, then it will call this
function, passing in the parsed wildcards, and use the result of the
function call as the value. For example, I could define a function:
and then make use of this within a rule as:
rule test_mass:
    params:
        mass=get_mass,
    ...This would then set the parameter params.mass to
1.0.
How can we make use of this for getting the plateau metadata?
PYTHON
def plateau_param(position):
    """
    Return a function that can be used to get a plateau position from the metadata.
    `position` should be `start` or `end`.
    """
    def get_plateau(wildcards):
        return lookup(
            within=metadata,
            query="beta == {beta}",
            cols=f"{wildcards['channel']}_plateau_{position}",
        )
    return get_plateauWorking inside out, you will recognise the lookup call
from before, but the cols argument is now changed: we make
use of Python’s f-strings
to insert both the channel and the position (start or
end). We wrap this into a function called
get_plateau, which only takes the wildcards as
an argument. The position is defined in the outer function,
which creates and returns the correct version of
get_plateau depending on which position is
specified. We need to do this because Snakemake doesn’t give a direct
way to specify additional arguments to the input function.
To use this function within the rule, we can use
plateau_start=plateau_param("start") in the
params block for the plateau start position, and similarly
for the end.
With these changes, the full rule now becomes:
# Compute meson mass and amplitude, read plateau from metadata,
# and plot effective mass
rule meson_mass:
    input: "raw_data/beta{beta}/out_corr"
    output:
        data="intermediary_data/beta{beta}/corr.{channel}_mass.json.gz",
        plot=multiext(
            "intermediary_data/beta{beta}/corr.{channel}_eff_mass",
            config["plot_filetype"],
        ),
    log:
        messages="intermediary_data/beta{beta}/corr.{channel}_mass.log",
    params:
        plateau_start=plateau_param("start"),
        plateau_end=plateau_param("end"),
    conda: "envs/analysis.yml"
    shell:
        "python -m su2pg_analysis.meson_mass {input} --channel {wildcards.channel} --output_file {output.data} --plateau_start {params.plateau_start} --plateau_end {params.plateau_end} --plot_file {output.plot} --plot_styles {config[plot_styles]} 2>&1 | tee {log.messages}"We can test this for the vector mass:
snakemake --use-conda --cores all --printshellcmds intermediary_data/beta2.0/corr.v_mass.json.gzRestricted spectrum
Define a rule restricted_spectrum that generates a plot
of the pseudoscalar decay constant against its mass (both in lattice
units), that only plots data where \(\beta\) is below a specified value
beta0, which is included in the filename.
Use this to test plotting the spectrum for \(\beta < 2.0\).
Hint: Similarly to the above, you may want to define a function that
defines a function, where the former takes the relevant fixed part of
the filename (corr.ps_mass or
pg.corr.ps_decay_const) as input, and the latter the
wildcards.
We can constrain the data plotted by constraining the input data.
Similarly to for the example above, we can use an input function for
this, but this time for the input data rather than to
define params.
def spectrum_param(slug):
    def spectrum_inputs(wildcards):
        return [
            f"intermediary_data/beta{beta}/{slug}.json.gz"
            for beta in metadata[metadata["beta"] < float(wildcards["beta0"])]["beta"]
        ]
    return spectrum_inputs
rule restricted_spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=spectrum_param("corr.ps_mass"),
        ps_decay_const=spectrum_param("pg.corr.ps_decay_const"),
    output:
        plot=multiext("assets/plots/spectrum_beta{beta0}", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]}"To generate the requested plot would then be:
snakemake --use-conda --cores all --printshellcmds assets/plots/spectrum_beta2.0.pdfGlobbing
So far, our workflow has been entirely deterministic in what inputs
to use. We have referred to parts of filenames that are substituted in
based on the requested outputs as “wildcards”. You might be familiar
with another meaning of the word “wildcard”, however: the *
and ? characters that you can use in the shell to match
any number of and one or more characters
respectively.
In general we would prefer to avoid using these in our workflows, since we would like them to be reproducible. If a particular input file is missing, we would like to be told about it, rather than the workflow silently producing different results. (This also makes the failure cases easier to debug, since otherwise we can hit cases where we call our tools with an unexpected number of inputs, and they fail with strange errors.)
However, occasionally we do find ourselves needing to perform
shell-style wildcard matches, also known as globbing. For
example, let’s say that we would like to be able to plot only the data
that we have already computed, without regenerating anything. We can
perform a wildcard glob using Python’s glob
function.
At the top of the file, add:
from glob import globThen we can add a rule:
rule quick_spectrum:
    input:
        script="src/plot_spectrum.py",
        ps_mass=glob(
            "intermediary_data/beta*/corr.ps_mass.json.gz",
        ),
        ps_decay_const=glob(
            "intermediary_data/beta*/pg.corr.ps_decay_const.json.gz",
        ),
    output:
        plot=multiext("intermediary_data/check_spectrum", config["plot_filetype"]),
    conda: "envs/analysis.yml"
    shell:
        "python {input.script} {input.ps_mass} {input.ps_decay_const} --y_observable f_ps --zero_y_axis --zero_x_axis --output_file {output.plot} --plot_styles {config[plot_styles]}"Let’s test this now
snakemake --use-conda --cores all --printshellcmds intermediary_data/check_spectrum.pdfIf you have recently purged the workflow output, then this might raise an error as there are no data to plot. Otherwise, opening this resulting file in your PDF viewer, you’ll most likely see a plot with a reduced number of points compared to previous plots. If you’re lucky, then all the points will be present.
Use with care in production!
This rule will only include data already present at the start of the workflow run in your plots. This means if you start from clean, as people looking to reproduce your work will, then it will fail, and even if you start from a partial run, then some points will (non-deterministically) be omitted from your plots.
For final plots based on intermediary data, always specify the
input explicitly.
Marking files as up to date
One of the useful aspects of Snakemake’s DAG is that it will automatically work out which files need updating based on changes to their inputs, and re-run any rules needed to update them. Occasionally, however, this isn’t what we want to do.
Consider the following example:
We have two steps to this workflow: taking the input data and from it generating some intermediary data, requiring 24 hours. Then we take the resulting data and plot results from it, taking 15 seconds. If we make a trivial change to the input file, such as reformatting white space, we’d like to be able to test the output stage without needing to wait 24 hours for the data to be regenerated.
Further, if we share our workflow with others, we’d like them to be able to reproduce the easy final stages without being required to run the more expensive early ones. This especially applies where the input data are very large, and the intermediary data smaller; this reduces the bandwidth and disk space required for those reproducing the work.
We can achieve this by telling Snakemake to “touch” the relevant
file. Similar to the touch shell command, which updates the
modification time of a file without making changes, Snakemake’s
--touch option updates Snakemake’s records of when a file
was last updated, without running any rules to update it.
For example, in the above workflow, we may run
snakemake --touch data_file
snakemake --cores all --use-conda --printshellcmds plot_fileThis will run only the second rule, not the first. Thiis will be the
same regardless of whether input_file is even present, or
when it was last updated.
- Use Python input functions that take a dict of wildcards, and return a list of strings, to handle complex dependency issues that can’t be expressed in pure Snakemake.
- Import glob.globto match multiple files on disk matching a specific pattern. Don’t rely on this finding intermediate files in final production workflows, since it won’t find files not present at the start of the workflow run.
- Use snakemake --touchif you need to mark files as up-to-date, so that Snakemake won’t try to regenerate them.