New subject: Transient errors with DIRAC jobs

23 Apr 2018

      Hi Rohini,

Please always include the mailing list. While Simon and me administer the
DIRAC instance we don't actually use it and other people might be better
placed to answer your questions.

Specifically Job IDs 8920431, 8920312, 8907180, 8897518 are failing with
...
Input data errors. However we have confirmed that the input data does in
fact exist and is accessible (locally with dirac-dms-get-file)
This looks like a catalogue error. Unfortunately when I try and search the
logs for the first job I find:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling
INFO: [JID 8920431] Single chosen site LCG.UKI-NORTHGRID-MAN-HEP.uk
specified
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling
INFO: [JID 8920431] Site candidates are ['CLOUD.Datacentred.uk', '
VAC.UKI-LT2-UCL-HEP.uk', 'VAC.UKI-NORTHGRID-MAN-HEP.uk', '
LCG.UKI-NORTHGRID-MAN-HEP.uk']

runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling
INFO: [JID 8920431] No staging required

runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling
INFO: [JID 8920431] Only site LCG.UKI-NORTHGRID-MAN-HEP.uk is
candidate
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling
INFO: [JID 8920431] Done

which as you can see has no error, so I have nothing to go on. I really
don't know what to do about this one, I will go and forward it to the DIRAC
developers.

(later it says:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19
11:30:33 UTC
WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID
8920431] Not in checking state. Avoid fast track
but even that is not an error)

Does the error above disappear when you rerun the jobs ?
...
Also, from time to time I have seen jobs fail with ApplicationStatus
'Cannot retrieve banned sites from JobDB' (most recently Job ID 8897033)
and also 'FileCatalog error ( 1604 : Failed to perform getReplicas from any
catalog)'  Job ID 8897076 (several from job group rohini.joshi.20180418103426)
and Therese has seen this problem too with Job ID 8865042
These errors seem to be transient and at times re-running jobs resolves
the problem.
We assume this is a bug in DIRAC as this has come up for other DIRAC
instances as well. We've done various modifications to our DIRAC instance
(mainly more of everything, as it looks a bit like a load/access problem),
but we cannot reproduce it on command, which makes debugging very hard.
We'll keep looking.
...
Just for some context, my jobs are uploading some data to RAL (in a lazy
way) and are essentially just running gfal-copy command to upload data from
DIRAC storage at Manchester to RAL. Therese's job is trying to run a
singularity container on a Manchester GPU node.
Do you have a retry loop (with a sleep between retries) for your uploads ?

@Therese: How do you target the GPU queue ?

Sorry that I can't be more helpful at the moment.

Regards,
Daniela

-- 
Sent from the pit of despair

-----------------------------------------------------------
daniela.bauer@imperial.ac.uk
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/

Re: [Gridpp-Dirac-Users] Transient errors with DIRAC jobs

Daniela Bauer

Daniela Bauer

Daniela Bauer

tags

participants (1)