Re: [Gridpp-Dirac-Users] Job 1864642: No space left on device
Hi Dan, I'm cc'ing the list as this concerns all users. What happened is that the sandbox space on dirac was full. We've just never seen that much use of our dirac instance before. We increased the space as soon as our nagios monitoring picked this up, put I guess not quick enough for all your jobs. Having said this, generally small sandboxes make for more efficient job submission/retrieval. I had a quick look in your sandbox and you seem to be shipping the same piece of software with each job. It's small, but it adds up, so this would probably better located in cvmfs. For large stdout/stderr, this can trip up dirac. If you (general 'you') expect large stdouts it's best to write them to a log file and then ship the log file back to the SE, at the end of your job, together with the your results, rather than try and go via a sandbox. Regards, Daniela On 28 November 2016 at 19:06, Dan Protopopescu < dan.protopopescu@glasgow.ac.uk> wrote:
Hi Daniela,
Can this be because the stdour and stderr from these jobs are immense?
2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Attempting to upload Sandbox with limit: 20971520 2016-11-28 18:11:08 UTC Wrapper_1864642 ERROR: Output sandbox upload failed with message Server error while serving fromClient: OSError(28, 'No space left on device') 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Attempting to upload /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 as output data 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Output data files /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 to be uploaded to ['GridPPSandboxSE'] SE 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: GUIDs not found from POOL XML Catalogue (and were generated) for: /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 2016-11-28 18:11:08 UTC Wrapper_1864642/FailoverTransfer INFO: Attempting dm.putAndRegister('/na62.vo.gridpp.ac.uk/user/r/robotgridclient1/1864/1864642/LDSB.40mSP7','/srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7','GridPPSandboxSE',guid='7A0FBA5C-05BE-5295-AB24-00559804301D',catalog='[ <http://na62.vo.gridpp.ac.uk/user/r/robotgridclient1/1864/1864642/LDSB.40mSP7','/srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7','GridPPSandboxSE',guid='7A0FBA5C-05BE-5295-AB24-00559804301D',catalog='%5B>]') 2016-11-28 18:11:09 UTC Wrapper_1864642/FailoverTransfer ERROR: dm.putAndRegister failed with message Failed to put file to Storage Element. Server error while serving fromClient: OSError(28, 'No space left on device') 2016-11-28 18:11:09 UTC Wrapper_1864642/FailoverTransfer ERROR: Failed to upload output data file Encountered 1 errors 2016-11-28 18:11:09 UTC Wrapper_1864642 ERROR: Could not putAndRegister file /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 with LFN /na62.vo.gridpp.ac.uk/user/r/robotgridclient1/1864/1864642/LDSB.40mSP7 to GridPPSandboxSE with GUID 7A0FBA5C-05BE-5295-AB24-00559804301D trying failover storage 2016-11-28 18:11:09 UTC Wrapper_1864642 INFO: No failover SEs defined for JobWrapper, cannot try to upload output file /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 anywhere else. 2016-11-28 18:11:09 UTC Wrapper_1864642 WARN: Failed to upload OutputData 2016-11-28 18:11:09 UTC Wrapper_1864642 EXCEPT: JobWrapper failed to process output files 2016-11-28 18:11:09 UTC Wrapper_1864642 EXCEPT: == EXCEPTION == JobWrapperError
Best regards, Dan
-- Dan PROTOPOPESCU, University of Glasgow, UK W:+44(0)141-330-4197 M:+44(0)794-046-3355 http://ppewww.physics.gla.ac.uk/~protopop/
-- Sent from the pit of despair ----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
Hi Dan, If your software is stable, I cannot recommend CVMFS enough - not least because you can also use it for local and batch system jobs too where /cvmfs is available. If you are concerned about your software being world-readable, Catalin (CCed) informs me that we could implement the new secure CVMFS (which they've got working in the US now). I can also add a new CVMFS section to the UserGuide [1] if there's enough demand - I've got a EUCLID user who's nearly there so the more the merrier. Thanks, Tom [1] http://www.gridpp.ac.uk/userguide On Tue, 29 Nov 2016 at 11:44 Daniela Bauer < daniela.bauer.grid@googlemail.com> wrote:
Hi Dan,
I'm cc'ing the list as this concerns all users.
What happened is that the sandbox space on dirac was full. We've just never seen that much use of our dirac instance before. We increased the space as soon as our nagios monitoring picked this up, put I guess not quick enough for all your jobs.
Having said this, generally small sandboxes make for more efficient job submission/retrieval.
I had a quick look in your sandbox and you seem to be shipping the same piece of software with each job. It's small, but it adds up, so this would probably better located in cvmfs. For large stdout/stderr, this can trip up dirac. If you (general 'you') expect large stdouts it's best to write them to a log file and then ship the log file back to the SE, at the end of your job, together with the your results, rather than try and go via a sandbox.
Regards, Daniela
On 28 November 2016 at 19:06, Dan Protopopescu < dan.protopopescu@glasgow.ac.uk> wrote:
Hi Daniela,
Can this be because the stdour and stderr from these jobs are immense?
2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Attempting to upload Sandbox with limit: 20971520 2016-11-28 18:11:08 UTC Wrapper_1864642 ERROR: Output sandbox upload failed with message Server error while serving fromClient: OSError(28, 'No space left on device') 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Attempting to upload /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 as output data 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: Output data files /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 to be uploaded to ['GridPPSandboxSE'] SE 2016-11-28 18:11:08 UTC Wrapper_1864642 INFO: GUIDs not found from POOL XML Catalogue (and were generated) for: /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 2016-11-28 18:11:08 UTC Wrapper_1864642/FailoverTransfer INFO: Attempting dm.putAndRegister('/na62.vo.gridpp.ac.uk/user/r/robotgridclient1/1864/1864642/LDSB.40mSP7','/srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7','GridPPSandboxSE',guid='7A0FBA5C-05BE-5295-AB24-00559804301D',catalog='[]') 2016-11-28 18:11:09 UTC Wrapper_1864642/FailoverTransfer ERROR: dm.putAndRegister failed with message Failed to put file to Storage Element. Server error while serving fromClient: OSError(28, 'No space left on device') 2016-11-28 18:11:09 UTC Wrapper_1864642/FailoverTransfer ERROR: Failed to upload output data file Encountered 1 errors 2016-11-28 18:11:09 UTC Wrapper_1864642 ERROR: Could not putAndRegister file /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 with LFN /na62.vo.gridpp.ac.uk/user/r/robotgridclient1/1864/1864642/LDSB.40mSP7 to GridPPSandboxSE with GUID 7A0FBA5C-05BE-5295-AB24-00559804301D trying failover storage 2016-11-28 18:11:09 UTC Wrapper_1864642 INFO: No failover SEs defined for JobWrapper, cannot try to upload output file /srv/localstage/scratch/284893.1.grid.q/LDSB.40mSP7 anywhere else. 2016-11-28 18:11:09 UTC Wrapper_1864642 WARN: Failed to upload OutputData 2016-11-28 18:11:09 UTC Wrapper_1864642 EXCEPT: JobWrapper failed to process output files 2016-11-28 18:11:09 UTC Wrapper_1864642 EXCEPT: == EXCEPTION == JobWrapperError
Best regards, Dan
-- Dan PROTOPOPESCU, University of Glasgow, UK W:+44(0)141-330-4197 M:+44(0)794-046-3355 http://ppewww.physics.gla.ac.uk/~protopop/
-- Sent from the pit of despair
----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/ -- _______________________________________________ Gridpp-Dirac-Users mailing list Gridpp-Dirac-Users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users
participants (2)
-
Daniela Bauer
-
Tom Whyntie