Snakemake workflows in Reana

casschmitt · 17 August 2022 12:15

I am new to Reana and trying to run a simple snakemake workflow. Am I correct in assuming that Reana does not support the script: directive of snakemake? Therefore I am now using shell: directives to run a python script. Then the direct passing of arguments as in snakemake is not available (e.g. snakemake.input etc.). How exactly do you pass arguments to the script? The example workflow uses a function in C, I am trying the same here in python:

rule Task3:
    input:
        "intermediateResult_10.txt",
        "intermediateResult_5.txt"
    output:
        "finalResult.txt"
    shell:
        "cat {input} > {output}"

rule Task2:
    input:
        seed = "seed.txt",
        script = config["generate_randoms"]
    output:
        "intermediateResult_{NumberOfRandoms}.txt"
    shell:
        " python '{input.script}(\"{input.seed}\",\"{output}\")'"

rule Task1:
    output:
        "seed.txt"
    shell:
        "echo 42 > {output}"

which calls the script:

#input_file = snakemake.input[0]
#output_file = snakemake.output[0]
#number_of_randoms = snakemake.params.current_NumberOfRandoms

import numpy

#import sys
#input_file = str(sys.argv[1])
#number_of_randoms = int(sys.argv[2])
#output_file = str(sys.argv[3])

def function(input_file,output_file):
        number_of_randoms = 5
        with open(input_file,"r") as input:
            numpy.random.seed(int(input.readlines()[0]))
        with open(output_file,"w") as output:
            for i in range(number_of_randoms): output.write(f"{numpy.random.random()}\n")

I am very unsure about the syntax here and cannot find any documentation. For completeness here is my reana_snakemake.yaml:

version: 0.8.0
inputs:
  files:
    - snakemake/python_script.py
  directories:
    - snakemake
  parameters:
    input: snakemake/inputs.yaml
workflow:
  type: snakemake
  file: snakemake/Snakefile
outputs:
  files:
    - finalResult.txt

and snakemake/inputs.yaml:

generate_randoms: snakemake/python_script.py

This workflow runs without problems locally. reana-client validate runs fine too, but reana-client run returns
==> ERROR: Cannot create workflow workflow:
Object of type function is not JSON serializable

I would be very greatful for any tips, suggestions or working examples!

tiborsimko · 19 August 2022 10:22

Hello @casschmitt!

Thanks for the clear description of your observations. Let me try reply on the various topics that you raised in separate paragraphs below.

Support for script vs run vs shell Snakemake directives in REANA

You are right that REANA does not support the script: directive or the run: directive, only the shell: one. One advantage is that this allows to better separate workflow orchestration tasks (run in a Snakemake Python process) from research runtime job tasks (run as independent processes in independent environments). I hope this may lead to more reusable analysis parts in various reuse scenarios that may come in the future. For example, upgrading to a very future Snakemake version, or even to another workflow manager, or reusing parts with colleagues not using Snakemake.)

Is it acceptable for you to use the shell: directive? If you think it may be very advantageous or otherwise important to support also script: and run: directives, we can revive our developer musings going in this direction. (The problem is not easy with the current REANA <-> Snakemake bridge architecture, but may have other side solutions.)

Troubles starting the workflow example

While I have been trying to reproduce the error in your example, I came across an internal bug in REANA related to validating and starting workflows that have “unnecessarily” duplicated input files, so to speak. I have created a bug demo in our tracker to fix this. The good news is that there is a simple workaround.

Perhaps the errors you were seeing may have been similar? Please try the following easy workaround: it is not necessary to specify both inputs.files and inputs.directories in reana.yaml when they point to the same items and when all the necessary input files are already covered by the inputs.directories directive already. You may be able to use only the following part to transfer everything:

inputs:
  directories:
    - snakemake
  parameters:
    input: snakemake/inputs.yaml

Will such a change help in your test scenario? If not, please let me know which REANA client version and which REANA cluster version are you using, for example from the output of the reana-client ping command.

How to use Snakemake parameters with the run directive

Concerning the question of passing parameters to the script, you can consult our RooFit example analysis that has as an input the number of events to generate (by default 20k) and that comes also with a Snakemake example. Basically, one declares the following inputs.yaml:

events: 20000
fitdata: code/fitdata.C
gendata: code/gendata.C

and in Snakefile one can use params directive:

rule gendata:
    input:
        gendata_tool=config["gendata"]
    output:
        "results/data.root"
    params:
        events=config["events"]
    container:
        "docker://reanahub/reana-env-root6:6.18.04"
    resources:
        kubernetes_memory_limit="256Mi"
    shell:
        "mkdir -p results && root -b -q '{input.gendata_tool}({params.events},\"{output}\")'"

Note the differences between input and params in the rule. Perhaps you can enrich your Snakefile in a similar way?

Specifying job container environments

Note also in the above Snakefile snippet that we usually recommend to specify a concrete container image version to use for each concrete Snakemake rule, which is missing in your example. (This is linked to the discussion of script vs run vs shell directives above; we like to document the exact computing environment where the job is supposed to run, independently of how the job is “orchestrated” by the workflow manager.)

If you don’t specify any concrete container image, REANA will use snakemake/snakemake:v6.8.0 image by default. This should be OK for your simply example, because the Snakemake 6.8.0 image should contain numpy prerequisite already. So no problems there. Mentioning this just a a side note that in general, specifying required job image is good for “encapsulating” all job’s necessary runtime environment dependencies, so that even when the Snakemake workflow orchestration version would change, you can have the exact numpy version that the original environment used. (This is why we use reana-env-root6:6.18.04 container image in the Roofit’s example Snakefile above.)

Note also that if you do use Docker and develop container job environments locally on your machine, you may want to use extended reana-client validate --environments option, which would check for this and other potential issues related to using dedicated containerised environments for various Snakemake rules

casschmitt · 21 September 2022 17:32

Thank you tiborsimko for your detailed answer. Please forgive the late answer due to holidays and other obligations.

I now managed to run Python scripts in snakemake on Reana by employing shell commands, albeit at t he cost of additional parsing code. I do think that for complex HEP workflows this may somewhat refute a useful advantage of workflow management, supplied within other frameworks.

You are right, that container environments should always be supplied for analysis preservation. Do you know whether the containerized: directive from snakemake is supported? It supplies local container .sif files to tasks or whole workflows.

Thank you again!