EPoll: Bad file descriptor polling for events
******************* This email originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address. ******************* Hello In the last few days I have noticed a small fraction of my jobs start to fail instantly with the error: (probably ~ 1/1000, though some sites maybe seem more susceptible than others) EPoll: Bad file descriptor polling for events (seems to be after <1s CPU time) The only thing I have changed in my jobs since this started to happen is that I now use the feature where you can specify LFN:/your/file in the inputSandbox (previously i was just manually issuing a download command inside the job). To simplify the situation, I made a test job, that has the LFN of a text file in the inputSandbox, and then the jobs just 'cat's out the content. Repeating this job a few times at IN2p3 (where I had seen this happen the most frequently, but it has happened at other sites too), I managed to bump into the error. e.g. DIRAC JOB ID: 29573283 I ran some test jobs without the LFN in the inputSandbox and they all ran fine (though this was a small sample so can't really conclude anything from that). So it seems likely it is linked to my use of inputSandbox to download files, but it is relatively rare that it actually causes an issue. Is this a known thing? Am I doing something wrong.. should I be using inputSandbox in this way? Cheers Sophie
Hi Sophie, It seems this is a bug in xrootd that's triggered by the DIRAC configuration. If xrootd is used directly within the DIRAC pilot (i.e. staging the data in with InputData), then there is a small chance that it will fail to launch the user script with the epoll error you see. Unfortunately the core DIRAC team don't seem to have agreed on a workaround yet: https://github.com/DIRACGrid/DIRAC/issues/4616 They also filed an upstream bug with xrootd for a proper fix, but that also hasn't gone anywhere: https://github.com/xrootd/xrootd/issues/1198 We'll contact the DIRAC people early next week (which may be particularly easy as it's the DIRAC user workshop) and see if we can get a workaround included in the code and/or deployed on the GridPP instance as soon as possible. Regards, Simon On Fri, May 07, 2021 at 06:08:19PM +0100, Sophie King wrote:
******************* Hello
In the last few days I have noticed a small fraction of my jobs start to fail instantly with the error: (probably ~ 1/1000, though some sites maybe seem more susceptible than others)
EPoll: Bad file descriptor polling for events
(seems to be after <1s CPU time)
The only thing I have changed in my jobs since this started to happen is that I now use the feature where you can specify LFN:/your/file in the inputSandbox (previously i was just manually issuing a download command inside the job).
To simplify the situation, I made a test job, that has the LFN of a text file in the inputSandbox, and then the jobs just 'cat's out the content. Repeating this job a few times at IN2p3 (where I had seen this happen the most frequently, but it has happened at other sites too), I managed to bump into the error.
e.g. DIRAC JOB ID: 29573283
I ran some test jobs without the LFN in the inputSandbox and they all ran fine (though this was a small sample so can't really conclude anything from that).
So it seems likely it is linked to my use of inputSandbox to download files, but it is relatively rare that it actually causes an issue. Is this a known thing? Am I doing something wrong.. should I be using inputSandbox in this way?
Cheers Sophie
participants (2)
- 
                
                Simon Fayer
- 
                
                Sophie King