Hi Aditya, The "pilot not running" error indicates that the site killed the job: In these instances we don't get any further feedback from the site (short of contacting the admins to ask them what the local batch system logged). The biggest causes of this are: - Jobs using too much (>2-4GB) of memory, this is by far the most common cause. - Jobs that don't use enough CPU time for a given amount of real time. - Jobs that use too much CPU, i.e. multithreaded jobs that only request a single CPU slot. I suggest that you run one of the jobs that failed locally using /usr/bin/time, top or a similar tool that can track resource usage and see if you can work out which of the above possibilities is happening. Regards, Simon On Wed, Oct 27, 2021 at 11:29:04AM +0000, Aditya Upreti wrote:
Hi Daniela,
Thanks for all your help thus far, I am able to retrieve the output files from the storage elements. I wanted to bring another issue to your concern (sorry to bother again!).
I recently submitted several jobs to the GRID (~1000), a few of them are running but a lot of them show the status failed with the status "Job stalled: pilot not running". Would you know why this is happening? Could it be because of some allotted quota per user, if so, how can I increase the quota as I would have to run a large number of jobs simultaneously?
Thank you so much.
Best Regards
Aditya Upreti | Ph.D. Candidate / Research Assistant
[Divider line]
Department of Physics
The University of Alabama<https://www.ua.edu/>
102 Gallalee Hall
Tuscaloosa, AL 35404
Phone 205-886-2914
aupreti@crimson.ua.edu