Failed to try resolving symlinks

msaimper · 25 March 2021 18:37

Dear experts,

I have a heavy ntuple production that I split into many steps with different option (group systematic 1, group systematic 2, etc). Originally this was to be able to use parallelization on a cluster, and I would like if possible to stick to this structure for my workflow.

When I launch a worflow with only a few of the steps (~5) it suceeds, but when I send a worflow with a large number of steps (~30) some randomly fails with this error message:

job: :
 failed to try resolving symlinks in path "/var/log/pods/default_reana-run-job-6b8ea706-2bef-43bf-8f28-1389c3722fe9-8kz56_827cb480-d5fc-4211-8dfc-791c2c5db2bc/job/0.log": lstat /var/log/pods/default_reana-run-job-6b8ea706-2bef-43bf-8f28-1389c3722fe9-8kz56_827cb480-d5fc-4211-8dfc-791c2c5db2bc/job/0.log: no such file or directory

Error
krb5: :
 unable to retrieve container logs for docker://97336489dddc332ed3825481fa85d6625faa35520fcdcb5805698a833da79651

Completed

Are there some limitations in terms of steps? I am not sure I understand this issue, in case experts can access my worflows here are examples:

REANA works
REANA fails

Thank you very much for your help!

Best wishes
Matthias

tiborsimko · 25 March 2021 20:45

Hi Matthias,

We have been upgrading the REANA cluster this evening, and I noticed that you have run some big workflows at the same time… So this was most probably the cause of failures. Sorry for the inconvenience.

The above problems (“unable to retreive container logs”) were observed on some nodes that had troubles with previous out-of-memory situations. We have deployed a fix to the REANA cluster and progressively rebooted all the nodes to solve the issue. The cluster was back in shape at around 19:15.

This means that your workflows should run better from now on. Please retry.

However, we have about twenty nodes available for user jobs, so if you launch too many parallel tasks (“heavy ntuple production”), the jobs may be stuck in the pending mode for quite a while before a node liberates.

How many parallel jobs do you need to run at some given amount of time? How long would such a job typically run? If it is really “heavy”, then we may have a need to enlarge the cluster…

Best regards,

Tibor

msaimper · 26 March 2021 08:23

Hi Tibor,

thanks a lot for your reply.

To give you a bit of context: I am currently implementing the RECAST workflow of my ATLAS analysis. I use the gitlab CI to test it on limited stat and with no systematics and this is good enough. However I thought I should validate also at least once the full workflow included all systematics and decent signal stat.

For this full validation I need to produce signal ntuples with all systematics included. In my standard workflow I distribute jobs on a batch system so natively this translates into 171 “skimming” steps in RECAST. Each of these is rather short [O(1 hr) if I run on decent stat).

After that I need to run the fit, which will typically happens in 3 rather expansive steps (building of all the histograms + building the WS + run the limit scan)

I hope this gives you a better idea of my task.

Cheers
Matthias

tiborsimko · 30 March 2021 19:09

Thanks for the detailed description. If your jobs are within 10 GB of memory and if the local ephemeral disk storage needs are within say 30 GB size, then this type of workflow should be well runnable on REANA. We might need to increase the number of nodes in the cluster though, since if some other workflow with all systematics is running at the same time (as is often the case these days!), then it could take quite a while to schedule all the jobs…

msaimper · 31 March 2021 08:41

Hi Tibor,
thanks for the reply, good to read that I can run this on Reana!

I tried again and it failed for reasons I do not understand:

https://reana.cern.ch/details/c48d12af-20d2-4119-9884-199c339b8aa9

could you have a look?

tiborsimko · 31 March 2021 10:23

I see in the logs:

$ reana-client logs -w ana-susy-2019-12_skimming_all_all_all_fitting_theory_weight_tree_scanning.1
...
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 52, in apply
    self.rule.apply(WorkflowView(adageobj, self.offset))
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 101, in apply
    self.schedule()
  File "/usr/local/lib/python3.8/site-packages/yadage/stages.py", line 144, in schedule
    scheduler(self, self.stagespec)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 198, in singlestep_stage
    parameters = {
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 199, in <dictcomp>
    k: select_parameter(stage.view, v)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/scheduler_handlers.py", line 49, in select_parameter
    value = handler(wflowview, parameter)
  File "/usr/local/lib/python3.8/site-packages/yadage/handlers/expression_handlers.py", line 158, in stage_output_selector
    assert len(steps) == 1
AssertionError

This could indicate some troubles with the workflow definition?

Here are some tips:

(1) In order to ease the debugging, it would be good if you could include your reana.yaml amongst the workflow input files. In this way it’ll be uploaded to the workspace together with the inputs. This is not necessary for successful workflow run, just to make the debugging easier, since reana.yaml would be included in the workflow’s workspace by default. For example, see files: ... part here:

inputs:
  parameters:
    did: 404958
    xsec_in_pb: 0.00122
    dxaod_file: https://recastwww.web.cern.ch/recastwww/data/reana-recast-demo/mc15_13TeV.123456.cap_recast_demo_signal_one.root
  directories:
    - workflow
  files:
    - reana.yaml
workflow:
  type: yadage
  file: workflow/workflow.yml
outputs:
  files:
    - statanalysis/fitresults/limit.png

(2) Have you tried to run reana-client validate to see about any possible warnings regarding workflow parameters and output selectors?

(3) FWIW I’m seeing in specs/steps.yml commands like:

cd /ttDM_DESY/run
xrdfs eoshome.cern.ch ls {input_dir} | grep '.root' > eos_paths.txt
mkdir -p inputs/$(basename {input_dir})
while read line; do xrdcp root://eoshome.cern.ch/$line inputs/$(basename {input_dir})/.; echo -e $(basename {input_dir}) >> signal.txt; done < eos_paths.txt

This may be troublesome in case of big input files, since the input files seem to be copied into a directory under /ttDM_DESY. This directory lives inside the container, i.e. it does not use workflow’s workspace, but the container’s ephemeral storage (which is volatile and of the order of tens of GBs). If you copy big files (?) then the container could run out of ephemeral storage of the node and it could get killed.

How big are your input files? Could this be the cause?

Generally speaking, it seems more advantageous to consider the container docker image as a R/O provider of necessary environment and software, where no data get written to, and use the automatically provided R/W workflow workspace for all data operations. In other words, if you don’t write into hard-coded paths under / such as /ttDM_DESY, but rather write into the default workspace directory created by REANA and automatically mounted into your container at the execution time, you’ll be having more data workspace to work with, and reduce the risk of ephemeral storage exhaustion.

This is similar to running Docker containers locally: you wouldn’t probably write any big files inside the containers directly, but would rather use an external data volume that will be mounted to the docker container via -v at the execution time.

Is it possible to rewrite your code to something as follows:

source /home/atlas/release_setup.sh
mkdir inputs
xrdcp root://example.org/bigfile.root inputs
/ttmDM_DESY/run.sh ./inputs

so that you always stay in the workspace?

If your input files that are being written into /ttDM_DESY are guaranteed to be really small, of the order of a 5 GBs or thereabouts, then this is probably not the cause though. Only if they could reach 30 GBs or thereabouts would this be problematic.