A simple job from the "first example" is running for 9+ hours

tiborsimko · 30 March 2021 16:35

Hi @ysmirnov, your example was submitted unfortunately at a very busy time when the REANA production cluster was overloaded with running numerous heavy workflows. This means that, firstly, your roofit example was queued for a long time before execution started, and, secondly, that even after it was started, the individual jobs of the workflow had to wait for cluster resources to liberate in order to be able to run.

During this time, the web interface showed the workflow to be in a “running” state, but actually most of the time the workflow was in a “pending” state, looking for cluster node resources to liberate. We shall amend the workflow status reporting to say “pending” rather than “running” in these kinds of situations.

Regarding the availability of workflow job logs, currently the logs are available only after a certain job finishes, both in the command line and on the web interface. The logs are not streamed “live” whilst the job runs. This is something that we plan to improve in May though.

So there was no mistake on your end at all, it is mostly that the REANA production cluster resources were overloaded and that the REANA web interface was showing optimistically a “running” status instead of the more correct “pending” status for your workflow jobs. Something we shall work to improve in April-May.

I see that your roofit workflow demo finished successfully in the meantime…