Strange dirac_ui experience (fixed by updating dirac_ui)
******************* This email originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address. ******************* Hi all, Recently I was working with a small-VO user at Edinburgh who was trying to use parametric jobs in their workflow and they had hit a strange problem. Their jobs were submitting to DIRAC and splitting correctly, but, the actual DIRAC jobs refused to run. The DIRAC scheduler marked the jobs as failed 1hr later as they failed to be proceed past the 'submitting' state in the JobManager. After debugging various combinations we think we've narrowed this down to the users dirac_ui instance was out of date. But, confusingly, they were still able to submit and manage individual jobs correctly. If anyone else hits this problem, updating their dirac_ui instance fixed the problem. With this in mind. Would it be possible to configure the GridPP DIRAC to not accept jobs from an out of date dirac_ui instance? I know that a new dirac_ui release is announced but not all users seem to act on this. If this can be done it might help avoid strange problems like this in the future, Best Regards, Rob -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Rob, On Tue, Mar 03, 2020 at 03:24:37PM +0000, Robert Currie wrote:
The DIRAC scheduler marked the jobs as failed 1hr later as they failed to be proceed past the 'submitting' state in the JobManager. After debugging various combinations we think we've narrowed this down to the users dirac_ui instance was out of date. But, confusingly, they were still able to submit and manage individual jobs correctly.
If anyone else hits this problem, updating their dirac_ui instance fixed the problem.
This was a change in the bulk submission behaviour to make it more transactional: Newer versions of DIRAC require a "I've finished submitting jobs" call to confirm the transaction, otherwise the server assumes there was a problem and cancels the jobs (allowing the client to try again safely without creating duplicates). As you discovered, clients that predate this change unfortunately don't work properly as they never confirm the jobs.
With this in mind. Would it be possible to configure the GridPP DIRAC to not accept jobs from an out of date dirac_ui instance?
Unfortunately this isn't a feature that's available at the moment. We could try to add something in the dirac-proxy-init handshake or similar, but then we have the problem of knowing which releases are compatible or not. In general the compatibility is good as the developers do try to keep the interface as consistent as possible.
If this can be done it might help avoid strange problems like this in the future,
We generally recommend that people re-install the UI to the latest version as one of the first steps if they see strange behaviour. If that doesn't work, please ask on this list: There is normally someone on here that's seen the same thing before, otherwise we're always happy to investigate. Regards, Simon
participants (2)
-
Robert Currie
-
Simon Fayer