A simple job from the "first example" is running for 9+ hours

ysmirnov · 30 March 2021 11:49

Hi,

following my previous post on this forum, I keep observing some strange behavior of my jobs. Some are queued for four days already; some fail with no logs available; some finish successfully (?) with no logs available; and there is this roofit job I submitted yesterday, which is running for 9 hours already:

This roofit job is the one from the “first example” section of the REANA documentation. Simply by looking at its code I would expect it to finish in 10 minutes, tops. What is it doing for 9 hours?

Most importantly, can it be due to some global mistake in my setup or something similar? Why are there no logs available for almost any of my jobs? Why are they all queued for long periods of time?

Thanks!

tiborsimko · 30 March 2021 16:35

Hi @ysmirnov, your example was submitted unfortunately at a very busy time when the REANA production cluster was overloaded with running numerous heavy workflows. This means that, firstly, your roofit example was queued for a long time before execution started, and, secondly, that even after it was started, the individual jobs of the workflow had to wait for cluster resources to liberate in order to be able to run.

During this time, the web interface showed the workflow to be in a “running” state, but actually most of the time the workflow was in a “pending” state, looking for cluster node resources to liberate. We shall amend the workflow status reporting to say “pending” rather than “running” in these kinds of situations.

Regarding the availability of workflow job logs, currently the logs are available only after a certain job finishes, both in the command line and on the web interface. The logs are not streamed “live” whilst the job runs. This is something that we plan to improve in May though.

So there was no mistake on your end at all, it is mostly that the REANA production cluster resources were overloaded and that the REANA web interface was showing optimistically a “running” status instead of the more correct “pending” status for your workflow jobs. Something we shall work to improve in April-May.

I see that your roofit workflow demo finished successfully in the meantime…

ysmirnov · 30 March 2021 16:38

Hi @tiborsimko,

I see, thanks for your reply. Yes, my job finished succesfully, so now I’m going to submit the same job on a much smaller number of events. Hopefully, it’ll finish mush faster…

tiborsimko · 30 March 2021 16:47

I see there are quite a few pending jobs on the production cluster still…

If you would like to run a quick example, or to develop/debug some workflow on a tiny data sample size, i.e. something that usually runs of the order of minutes, then you may perhaps want to check out our QA cluster. It has only two nodes for running user jobs, but the cluster is usually free, so your workflows would not be “competing” against workflows of other heavy users, so to speak.

If you would like to give the QA cluster a try, the address to use is https://reana-qa.cern.ch but please note that the web interface is behind CERN firewall, so you will have to use something like ssh tunnelling through LXPLUS to access it. (E.g. sshuttle.)

You will have to ask for a new access token there, since access tokens are particular to each REANA cluster instance.

Otherwise the set up is the same as in production, currently the QA and PRODUCTION clusters are running the same version of REANA.

$ source /afs/cern.ch/user/r/reana/public/reana-qa/bin/activate
$ export REANA_SERVER_URL=https://reana-qa.cern.ch
$ export REANA_ACCESS_TOKEN=xxxxxxxxxxxx