Hi Rob, On Tue, Mar 03, 2020 at 03:24:37PM +0000, Robert Currie wrote:
The DIRAC scheduler marked the jobs as failed 1hr later as they failed to be proceed past the 'submitting' state in the JobManager. After debugging various combinations we think we've narrowed this down to the users dirac_ui instance was out of date. But, confusingly, they were still able to submit and manage individual jobs correctly.
If anyone else hits this problem, updating their dirac_ui instance fixed the problem.
This was a change in the bulk submission behaviour to make it more transactional: Newer versions of DIRAC require a "I've finished submitting jobs" call to confirm the transaction, otherwise the server assumes there was a problem and cancels the jobs (allowing the client to try again safely without creating duplicates). As you discovered, clients that predate this change unfortunately don't work properly as they never confirm the jobs.
With this in mind. Would it be possible to configure the GridPP DIRAC to not accept jobs from an out of date dirac_ui instance?
Unfortunately this isn't a feature that's available at the moment. We could try to add something in the dirac-proxy-init handshake or similar, but then we have the problem of knowing which releases are compatible or not. In general the compatibility is good as the developers do try to keep the interface as consistent as possible.
If this can be done it might help avoid strange problems like this in the future,
We generally recommend that people re-install the UI to the latest version as one of the first steps if they see strange behaviour. If that doesn't work, please ask on this list: There is normally someone on here that's seen the same thing before, otherwise we're always happy to investigate. Regards, Simon