Hi, I have job 18116468 submitted on behalf of DUNE. It keeps complaining that it is unable to retrieve the proxy, though DIRAC certainly has a valid proxy of mine uploaded - other jobs are running fine. Is there any way for me to understand what is happening here? Thanks and Kind Regards, Raja.
Hi Raja, I think this is normally caused by the pilot having trouble at a specific site (in this case VAC.UKI-SOUTHGRID-BHAM-HEP.uk). Could you please try submitting a couple of jobs targeting specific sites (including that one) and see if the problems occur elsewhere? I'd normally look in the pilot log for this, but I can't work out the correct URL to get the log for that job (if its been uploaded at all). Regards, Simon On Fri, Jul 26, 2019 at 04:42:25PM +0100, Raja Nandakumar wrote:
Hi,
I have job 18116468 submitted on behalf of DUNE. It keeps complaining that it is unable to retrieve the proxy, though DIRAC certainly has a valid proxy of mine uploaded - other jobs are running fine.
Is there any way for me to understand what is happening here?
Thanks and Kind Regards, Raja.
Hi Sima, So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try. Cheers, Raja. On 26/07/19 17:02, Simon Fayer wrote:
Hi Raja,
I think this is normally caused by the pilot having trouble at a specific site (in this case VAC.UKI-SOUTHGRID-BHAM-HEP.uk). Could you please try submitting a couple of jobs targeting specific sites (including that one) and see if the problems occur elsewhere?
I'd normally look in the pilot log for this, but I can't work out the correct URL to get the log for that job (if its been uploaded at all).
Regards, Simon
On Fri, Jul 26, 2019 at 04:42:25PM +0100, Raja Nandakumar wrote:
Hi,
I have job 18116468 submitted on behalf of DUNE. It keeps complaining that it is unable to retrieve the proxy, though DIRAC certainly has a valid proxy of mine uploaded - other jobs are running fine.
Is there any way for me to understand what is happening here?
Thanks and Kind Regards, Raja.
Hiya, I'm happy to look into this from the site point of view - Raja: Could you point me at a typical (and fairly recent!) job ID and worker node and I'll have a look at this end. I'm not aware of any general issues but it sounds like I'm doing something wrong! Thanks, Mark On 29/07/2019 10:50, raja.nandakumar@cern.ch wrote:
Hi Sima,
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
Cheers, Raja.
On 26/07/19 17:02, Simon Fayer wrote:
Hi Raja,
I think this is normally caused by the pilot having trouble at a specific site (in this case VAC.UKI-SOUTHGRID-BHAM-HEP.uk). Could you please try submitting a couple of jobs targeting specific sites (including that one) and see if the problems occur elsewhere?
I'd normally look in the pilot log for this, but I can't work out the correct URL to get the log for that job (if its been uploaded at all).
Regards, Simon
On Fri, Jul 26, 2019 at 04:42:25PM +0100, Raja Nandakumar wrote:
Hi,
I have job 18116468 submitted on behalf of DUNE. It keeps complaining that it is unable to retrieve the proxy, though DIRAC certainly has a valid proxy of mine uploaded - other jobs are running fine.
Is there any way for me to understand what is happening here?
Thanks and Kind Regards, Raja.
Hi Mark, Given that it is a VAC site, I am not able to get the pilot information for the job. Also, the job parameters is not filled as the job has failed early on. So, I am not able to give you a host name either. The only information I have is the DIRAC job ID : 18187968 Maybe Simon has more karma about it? Cheers, Raja. On 29/07/19 10:55, Mark Slater wrote:
Hiya,
I'm happy to look into this from the site point of view - Raja: Could you point me at a typical (and fairly recent!) job ID and worker node and I'll have a look at this end. I'm not aware of any general issues but it sounds like I'm doing something wrong!
Thanks,
Mark
On 29/07/2019 10:50, raja.nandakumar@cern.ch wrote:
Hi Sima,
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
Cheers, Raja.
On 26/07/19 17:02, Simon Fayer wrote:
Hi Raja,
I think this is normally caused by the pilot having trouble at a specific site (in this case VAC.UKI-SOUTHGRID-BHAM-HEP.uk). Could you please try submitting a couple of jobs targeting specific sites (including that one) and see if the problems occur elsewhere?
I'd normally look in the pilot log for this, but I can't work out the correct URL to get the log for that job (if its been uploaded at all).
Regards, Simon
On Fri, Jul 26, 2019 at 04:42:25PM +0100, Raja Nandakumar wrote:
Hi,
I have job 18116468 submitted on behalf of DUNE. It keeps complaining that it is unable to retrieve the proxy, though DIRAC certainly has a valid proxy of mine uploaded - other jobs are running fine.
Is there any way for me to understand what is happening here?
Thanks and Kind Regards, Raja.
Hi Mark, On Mon, Jul 29, 2019 at 12:19:10PM +0100, Raja Nandakumar wrote:
The only information I have is the DIRAC job ID : 18187968 Maybe Simon has more karma about it?
This job retried a few times and did eventually run. I think I've found the reference from a previous failed run in the matcher log, perhaps that will give some clues? vm://vac.ph.bham.ac.uk/vac.ph.bham.ac.uk:94866fc5-ccd6-43b5-b778-bb32a880b7c6:gds-vm-dune Regards, Simon
Hi Raja, On Mon, Jul 29, 2019 at 10:50:41AM +0100, Raja Nandakumar wrote:
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
It's increasingly sounding like this is a site-specific problem, so it's probably not worth putting too much effort in to test. I'd just keep an eye out for any similar "Unable to retrieve proxy" errors for now (let us know if you spot any). If you really want to run a test, I'd suggest targeting sites that have run the largest amounts of DUNE work in the last month from the EGI accounting portal: If the normal DUNE glideinWMS jobs are running there, then DIRAC jobs should also be successful there too. Regards, Simon
Hi Simon, It looks like LCG.SARA-MATRIX.nl also has this problem, though after enough reschedulings the jobs eventually run there. Regards, Raja. On 29/07/19 21:30, Simon Fayer wrote:
Hi Raja,
On Mon, Jul 29, 2019 at 10:50:41AM +0100, Raja Nandakumar wrote:
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
It's increasingly sounding like this is a site-specific problem, so it's probably not worth putting too much effort in to test. I'd just keep an eye out for any similar "Unable to retrieve proxy" errors for now (let us know if you spot any).
If you really want to run a test, I'd suggest targeting sites that have run the largest amounts of DUNE work in the last month from the EGI accounting portal: If the normal DUNE glideinWMS jobs are running there, then DIRAC jobs should also be successful there too.
Regards, Simon
Hi Raja, Hmm, it's failing to add the VOMS extension onto the proxy for some reason (full error below). I can't see anything wrong with this other than the slightly dubious -vomses option on the command (it points to /opt/dirac rather than the job work dir). We'll look at running some test jobs in the next couple of days to try and work out what's going on there. Regards, Simon 2019-08-01 13:13:32 UTC WorkloadManagement/JobAgent ERROR: Could not retrieve payload proxy Cannot append voms extension: VOMS Error ( 1121 : Failed to set VOMS attributes. Command: voms-proxy-init2 -cert "/tmp/tmppr6baQ" -key "/tmp/tmppr6baQ" -out "/tmp/tmp0KDf2R" -voms "dune:/dune, /dune/Role=Analysis" -valid "5901:38" -vomses "/opt/dirac/etc/grid-security/vomses" -r -timeout 12; StdOut: Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nraja/CN=471708/CN=Raja Nandakumar/CN=4095007602 Creating temporary proxy Done Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=IL/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=voms1.fnal.gov] "dune" Done Creating proxy Done Your proxy is valid until Fri Apr 3 11:51:32 2020 ; StdErr: .......................................................................... Warning: voms1.fnal.gov:15042: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! ...................................................Error: verification failed. Cannot verify AC signature! ) On Thu, Aug 01, 2019 at 02:55:30PM +0100, Raja Nandakumar wrote:
Hi Simon,
It looks like LCG.SARA-MATRIX.nl also has this problem, though after enough reschedulings the jobs eventually run there.
Regards, Raja.
On 29/07/19 21:30, Simon Fayer wrote:
Hi Raja,
On Mon, Jul 29, 2019 at 10:50:41AM +0100, Raja Nandakumar wrote:
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
It's increasingly sounding like this is a site-specific problem, so it's probably not worth putting too much effort in to test. I'd just keep an eye out for any similar "Unable to retrieve proxy" errors for now (let us know if you spot any).
If you really want to run a test, I'd suggest targeting sites that have run the largest amounts of DUNE work in the last month from the EGI accounting portal: If the normal DUNE glideinWMS jobs are running there, then DIRAC jobs should also be successful there too.
Regards, Simon
Hi Simon, Thanks. And now it is happening at Imperial too. Regards, Raja. On 01/08/19 18:14, Simon Fayer wrote:
Hi Raja,
Hmm, it's failing to add the VOMS extension onto the proxy for some reason (full error below). I can't see anything wrong with this other than the slightly dubious -vomses option on the command (it points to /opt/dirac rather than the job work dir).
We'll look at running some test jobs in the next couple of days to try and work out what's going on there.
Regards, Simon
2019-08-01 13:13:32 UTC WorkloadManagement/JobAgent ERROR: Could not retrieve payload proxy Cannot append voms extension: VOMS Error ( 1121 : Failed to set VOMS attributes. Command: voms-proxy-init2 -cert "/tmp/tmppr6baQ" -key "/tmp/tmppr6baQ" -out "/tmp/tmp0KDf2R" -voms "dune:/dune, /dune/Role=Analysis" -valid "5901:38" -vomses "/opt/dirac/etc/grid-security/vomses" -r -timeout 12; StdOut: Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nraja/CN=471708/CN=Raja Nandakumar/CN=4095007602 Creating temporary proxy Done Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=IL/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=voms1.fnal.gov] "dune" Done Creating proxy Done
Your proxy is valid until Fri Apr 3 11:51:32 2020 ; StdErr: .......................................................................... Warning: voms1.fnal.gov:15042: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!
...................................................Error: verification failed. Cannot verify AC signature! )
On Thu, Aug 01, 2019 at 02:55:30PM +0100, Raja Nandakumar wrote:
Hi Simon,
It looks like LCG.SARA-MATRIX.nl also has this problem, though after enough reschedulings the jobs eventually run there.
Regards, Raja.
On 29/07/19 21:30, Simon Fayer wrote:
Hi Raja,
On Mon, Jul 29, 2019 at 10:50:41AM +0100, Raja Nandakumar wrote:
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
It's increasingly sounding like this is a site-specific problem, so it's probably not worth putting too much effort in to test. I'd just keep an eye out for any similar "Unable to retrieve proxy" errors for now (let us know if you spot any).
If you really want to run a test, I'd suggest targeting sites that have run the largest amounts of DUNE work in the last month from the EGI accounting portal: If the normal DUNE glideinWMS jobs are running there, then DIRAC jobs should also be successful there too.
Regards, Simon
Hi Raja, The ones failing at Imperial have a subtly different proxy error in the pilot log: It looks like they were submitted with the dune_production group, but you don't have a dune_production proxy uploaded. If that was a deliberate choice, please run "dirac-proxy-init -g dune_production -U" on your UI. Regards, Simon On Fri, Aug 02, 2019 at 04:26:35PM +0100, Raja Nandakumar wrote:
Hi Simon,
Thanks. And now it is happening at Imperial too.
Regards, Raja.
On 01/08/19 18:14, Simon Fayer wrote:
Hi Raja,
Hmm, it's failing to add the VOMS extension onto the proxy for some reason (full error below). I can't see anything wrong with this other than the slightly dubious -vomses option on the command (it points to /opt/dirac rather than the job work dir).
We'll look at running some test jobs in the next couple of days to try and work out what's going on there.
Regards, Simon
Hi Simon, I have no idea how they got into the dune_production group, as I have submitted everything with exactly the same proxy so far. In any case I have now uploaded a production proxy as you suggested below. Cheers, Raja. On 02/08/19 19:47, Simon Fayer wrote:
Hi Raja,
The ones failing at Imperial have a subtly different proxy error in the pilot log: It looks like they were submitted with the dune_production group, but you don't have a dune_production proxy uploaded. If that was a deliberate choice, please run "dirac-proxy-init -g dune_production -U" on your UI.
Regards, Simon
On Fri, Aug 02, 2019 at 04:26:35PM +0100, Raja Nandakumar wrote:
Hi Simon,
Thanks. And now it is happening at Imperial too.
Regards, Raja.
On 01/08/19 18:14, Simon Fayer wrote:
Hi Raja,
Hmm, it's failing to add the VOMS extension onto the proxy for some reason (full error below). I can't see anything wrong with this other than the slightly dubious -vomses option on the command (it points to /opt/dirac rather than the job work dir).
We'll look at running some test jobs in the next couple of days to try and work out what's going on there.
Regards, Simon
Hi All, Apologies for being radio silent for the past couple of weeks - school holidays have been sucking up all my time! Is this still a problem? I can see dune jobs have run recently. It sounds like it's not just a Bham thing but I'm quite happy to help debug if I can! It would help if you could let me know the dirac ID and worker node of a recent failed job and I can have a closer look... THanks! Mark On 02/08/2019 16:26, raja.nandakumar@cern.ch wrote:
Hi Simon,
Thanks. And now it is happening at Imperial too.
Regards, Raja.
On 01/08/19 18:14, Simon Fayer wrote:
Hi Raja,
Hmm, it's failing to add the VOMS extension onto the proxy for some reason (full error below). I can't see anything wrong with this other than the slightly dubious -vomses option on the command (it points to /opt/dirac rather than the job work dir).
We'll look at running some test jobs in the next couple of days to try and work out what's going on there.
Regards, Simon
2019-08-01 13:13:32 UTC WorkloadManagement/JobAgent ERROR: Could not retrieve payload proxy Cannot append voms extension: VOMS Error ( 1121 : Failed to set VOMS attributes. Command: voms-proxy-init2 -cert "/tmp/tmppr6baQ" -key "/tmp/tmppr6baQ" -out "/tmp/tmp0KDf2R" -voms "dune:/dune, /dune/Role=Analysis" -valid "5901:38" -vomses "/opt/dirac/etc/grid-security/vomses" -r -timeout 12; StdOut: Your identity: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=nraja/CN=471708/CN=Raja Nandakumar/CN=4095007602 Creating temporary proxy Done Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=IL/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=voms1.fnal.gov] "dune" Done Creating proxy Done
Your proxy is valid until Fri Apr 3 11:51:32 2020 ; StdErr: ..........................................................................
Warning: voms1.fnal.gov:15042: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!
...................................................Error: verification failed. Cannot verify AC signature! )
On Thu, Aug 01, 2019 at 02:55:30PM +0100, Raja Nandakumar wrote:
Hi Simon,
It looks like LCG.SARA-MATRIX.nl also has this problem, though after enough reschedulings the jobs eventually run there.
Regards, Raja.
On 29/07/19 21:30, Simon Fayer wrote:
Hi Raja,
On Mon, Jul 29, 2019 at 10:50:41AM +0100, Raja Nandakumar wrote:
So far in my various tests, it has been BHAM-HEP which has had this issue. Could you let me know which sites you would like me to target and I will try.
It's increasingly sounding like this is a site-specific problem, so it's probably not worth putting too much effort in to test. I'd just keep an eye out for any similar "Unable to retrieve proxy" errors for now (let us know if you spot any).
If you really want to run a test, I'd suggest targeting sites that have run the largest amounts of DUNE work in the last month from the EGI accounting portal: If the normal DUNE glideinWMS jobs are running there, then DIRAC jobs should also be successful there too.
Regards, Simon
participants (3)
-
Mark Slater
-
Raja Nandakumar
-
Simon Fayer