Re: [Gridpp-Dirac-Users] Transient errors with DIRAC jobs
Hi Rohini, Please always include the mailing list. While Simon and me administer the DIRAC instance we don't actually use it and other people might be better placed to answer your questions. Specifically Job IDs 8920431, 8920312, 8907180, 8897518 are failing with
Input data errors. However we have confirmed that the input data does in fact exist and is accessible (locally with dirac-dms-get-file)
This looks like a catalogue error. Unfortunately when I try and search the logs for the first job I find:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Single chosen site LCG.UKI-NORTHGRID-MAN-HEP.uk specified runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Site candidates are ['CLOUD.Datacentred.uk', ' VAC.UKI-LT2-UCL-HEP.uk', 'VAC.UKI-NORTHGRID-MAN-HEP.uk', ' LCG.UKI-NORTHGRID-MAN-HEP.uk'] runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] No staging required runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Only site LCG.UKI-NORTHGRID-MAN-HEP.uk is candidate runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Done which as you can see has no error, so I have nothing to go on. I really don't know what to do about this one, I will go and forward it to the DIRAC developers. (later it says: runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 11:30:33 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Not in checking state. Avoid fast track but even that is not an error) Does the error above disappear when you rerun the jobs ?
Also, from time to time I have seen jobs fail with ApplicationStatus 'Cannot retrieve banned sites from JobDB' (most recently Job ID 8897033) and also 'FileCatalog error ( 1604 : Failed to perform getReplicas from any catalog)' Job ID 8897076 (several from job group rohini.joshi.20180418103426) and Therese has seen this problem too with Job ID 8865042 These errors seem to be transient and at times re-running jobs resolves the problem.
We assume this is a bug in DIRAC as this has come up for other DIRAC instances as well. We've done various modifications to our DIRAC instance (mainly more of everything, as it looks a bit like a load/access problem), but we cannot reproduce it on command, which makes debugging very hard. We'll keep looking.
Just for some context, my jobs are uploading some data to RAL (in a lazy way) and are essentially just running gfal-copy command to upload data from DIRAC storage at Manchester to RAL. Therese's job is trying to run a singularity container on a Manchester GPU node.
Do you have a retry loop (with a sleep between retries) for your uploads ? @Therese: How do you target the GPU queue ? Sorry that I can't be more helpful at the moment. Regards, Daniela -- Sent from the pit of despair ----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
Hi Rohini et al, I had another look at job 8920431 and I now noticed that Tags = "skatelescope.eu.gpu" appears in your JDL. This tag is not set in the configuration system (and I am not aware of any requests to set it). I assume this should go with ce01.tier2.hep,manchester.ac.uk CE and the nordugrid-Condor-gpu queue ? Maybe Andrew can confirm ? I don't think we have tested tags on ARC-CEs yet, so I din't know if it will work even if I set it. The error message is a bit misleading, but I think what it is trying to tell you that there is no place with this tag and your data and in its own way it is correct. Regards, Daniela On 23 April 2018 at 11:46, Daniela Bauer <daniela.bauer.grid@googlemail.com> wrote:
Hi Rohini,
Please always include the mailing list. While Simon and me administer the DIRAC instance we don't actually use it and other people might be better placed to answer your questions.
Specifically Job IDs 8920431, 8920312, 8907180, 8897518 are failing with
Input data errors. However we have confirmed that the input data does in fact exist and is accessible (locally with dirac-dms-get-file)
This looks like a catalogue error. Unfortunately when I try and search the logs for the first job I find:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Single chosen site LCG.UKI-NORTHGRID-MAN-HEP.uk specified runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Site candidates are ['CLOUD.Datacentred.uk', ' VAC.UKI-LT2-UCL-HEP.uk', 'VAC.UKI-NORTHGRID-MAN-HEP.uk', ' LCG.UKI-NORTHGRID-MAN-HEP.uk']
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] No staging required
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Only site LCG.UKI-NORTHGRID-MAN-HEP.uk is candidate runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Done
which as you can see has no error, so I have nothing to go on. I really don't know what to do about this one, I will go and forward it to the DIRAC developers.
(later it says: runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 11:30:33 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Not in checking state. Avoid fast track but even that is not an error)
Does the error above disappear when you rerun the jobs ?
Also, from time to time I have seen jobs fail with ApplicationStatus 'Cannot retrieve banned sites from JobDB' (most recently Job ID 8897033) and also 'FileCatalog error ( 1604 : Failed to perform getReplicas from any catalog)' Job ID 8897076 (several from job group rohini.joshi.20180418103426) and Therese has seen this problem too with Job ID 8865042 These errors seem to be transient and at times re-running jobs resolves the problem.
We assume this is a bug in DIRAC as this has come up for other DIRAC instances as well. We've done various modifications to our DIRAC instance (mainly more of everything, as it looks a bit like a load/access problem), but we cannot reproduce it on command, which makes debugging very hard. We'll keep looking.
Just for some context, my jobs are uploading some data to RAL (in a lazy way) and are essentially just running gfal-copy command to upload data from DIRAC storage at Manchester to RAL. Therese's job is trying to run a singularity container on a Manchester GPU node.
Do you have a retry loop (with a sleep between retries) for your uploads ?
@Therese: How do you target the GPU queue ?
Sorry that I can't be more helpful at the moment.
Regards, Daniela
-- Sent from the pit of despair
----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
-- Sent from the pit of despair ----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
Hi All, also if I look at the GPU queue on ce01 at Manchester it doesn't seem to support ska. I have to admit I have no idea how your job ever got into the state you found it in. I tried to replicate your jdl at Manchester using the gridpp VO, but my jobs never generate a pilot job, so all I can think of is that an already submitted pilot job picked your job up and then didn't know what to do with it. I think what we need to do is: - Manchester to enable ska on the GPU queue. - We will then attach a Tag "gpu" (as far as I am aware this is the agreed Tag via LHCb) to it - Then you can try to resubmit. Regards, Daniela On 23 April 2018 at 14:54, Daniela Bauer <daniela.bauer.grid@googlemail.com> wrote:
Hi Rohini et al,
I had another look at job 8920431 and I now noticed that
Tags = "skatelescope.eu.gpu"
appears in your JDL.
This tag is not set in the configuration system (and I am not aware of any requests to set it). I assume this should go with ce01.tier2.hep,manchester.ac.uk CE and the nordugrid-Condor-gpu queue ? Maybe Andrew can confirm ? I don't think we have tested tags on ARC-CEs yet, so I din't know if it will work even if I set it.
The error message is a bit misleading, but I think what it is trying to tell you that there is no place with this tag and your data and in its own way it is correct.
Regards, Daniela
On 23 April 2018 at 11:46, Daniela Bauer <daniela.bauer.grid@ googlemail.com> wrote:
Hi Rohini,
Please always include the mailing list. While Simon and me administer the DIRAC instance we don't actually use it and other people might be better placed to answer your questions.
Specifically Job IDs 8920431, 8920312, 8907180, 8897518 are failing with
Input data errors. However we have confirmed that the input data does in fact exist and is accessible (locally with dirac-dms-get-file)
This looks like a catalogue error. Unfortunately when I try and search the logs for the first job I find:
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Single chosen site LCG.UKI-NORTHGRID-MAN-HEP.uk specified runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Site candidates are ['CLOUD.Datacentred.uk', ' VAC.UKI-LT2-UCL-HEP.uk', 'VAC.UKI-NORTHGRID-MAN-HEP.uk', ' LCG.UKI-NORTHGRID-MAN-HEP.uk']
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] No staging required
runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Only site LCG.UKI-NORTHGRID-MAN-HEP.uk is candidate runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 10:37:53 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Done
which as you can see has no error, so I have nothing to go on. I really don't know what to do about this one, I will go and forward it to the DIRAC developers.
(later it says: runit/WorkloadManagement/Optimizers_1/log/@400000005ad9a5a434a9aaa4.s:2018-04-19 11:30:33 UTC WorkloadManagement/Optimizers_1/WorkloadManagement/JobScheduling INFO: [JID 8920431] Not in checking state. Avoid fast track but even that is not an error)
Does the error above disappear when you rerun the jobs ?
Also, from time to time I have seen jobs fail with ApplicationStatus 'Cannot retrieve banned sites from JobDB' (most recently Job ID 8897033) and also 'FileCatalog error ( 1604 : Failed to perform getReplicas from any catalog)' Job ID 8897076 (several from job group rohini.joshi.20180418103426) and Therese has seen this problem too with Job ID 8865042 These errors seem to be transient and at times re-running jobs resolves the problem.
We assume this is a bug in DIRAC as this has come up for other DIRAC instances as well. We've done various modifications to our DIRAC instance (mainly more of everything, as it looks a bit like a load/access problem), but we cannot reproduce it on command, which makes debugging very hard. We'll keep looking.
Just for some context, my jobs are uploading some data to RAL (in a lazy way) and are essentially just running gfal-copy command to upload data from DIRAC storage at Manchester to RAL. Therese's job is trying to run a singularity container on a Manchester GPU node.
Do you have a retry loop (with a sleep between retries) for your uploads ?
@Therese: How do you target the GPU queue ?
Sorry that I can't be more helpful at the moment.
Regards, Daniela
-- Sent from the pit of despair
----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
-- Sent from the pit of despair
----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
-- Sent from the pit of despair ----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
participants (1)
-
Daniela Bauer