Hi Rohini,

Please always include the mailing list. While Simon and me administer the DIRAC instance we don't actually use it and other people might be better placed to answer your questions.

Specifically Job IDs 8920431, 8920312, 8907180, 8897518 are failing with Input data errors. However we have confirmed that the input data does in fact exist and is accessible (locally with dirac-dms-get-file)

This looks like a catalogue error. Unfortunately when I try and search the logs for the first job I find:

runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Single chosen site LCG.UKI-NORTHGRID-MAN-HEP.uk specified
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Site candidates are ['CLOUD.Datacentred.uk', 'VAC.UKI-LT2-UCL-HEP.uk', 'VAC.UKI-NORTHGRID-MAN-HEP.uk', 'LCG.UKI-NORTHGRID-MAN-HEP.uk']
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] No staging required
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Only site LCG.UKI-NORTHGRID-MAN-HEP.uk is candidate
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Done

which as you can see has no error, so I have nothing to go on. I really don't know what to do about this one, I will go and forward it to the DIRAC developers.

(later it says:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 11:30:33 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Not in checking state. Avoid fast track

but even that is not an error)

Does the error above disappear when you rerun the jobs ?

Also, from time to time I have seen jobs fail with ApplicationStatus 'Cannot retrieve banned sites from JobDB' (most recently Job ID 8897033) and also 'FileCatalog error ( 1604 : Failed to perform getReplicas from any catalog)' Job ID 8897076 (several from job group rohini.joshi.20180418103426) and Therese has seen this problem too with Job ID 8865042

These errors seem to be transient and at times re-running jobs resolves the problem.

We assume this is a bug in DIRAC as this has come up for other DIRAC instances as well. We've done various modifications to our DIRAC instance (mainly more of everything, as it looks a bit like a load/access problem), but we cannot reproduce it on command, which makes debugging very hard. We'll keep looking.

Just for some context, my jobs are uploading some data to RAL (in a lazy way) and are essentially just running gfal-copy command to upload data from DIRAC storage at Manchester to RAL. Therese's job is trying to run a singularity container on a Manchester GPU node.

Do you have a retry loop (with a sleep between retries) for your uploads ?

@Therese: How do you target the GPU queue ?

Sorry that I can't be more helpful at the moment.

Regards,

Daniela

Sent from the pit of despair

-----------------------------------------------------------
daniela.bauer@imperial.ac.uk
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/