jobs in submitted state for more than a week
Hi, I am trying to do a MoEDAL simulation with Ganga. I submitted 1080 jobs initially. While 900 jobs were processed correctly producing the desired output, rest of the 180 jobs were stuck in the grid (with the status 'running') for more than 3 weeks. Last week, I killed those jobs and resent them to the gird. Since then, all my jobs are in the submitted state. Not a single one has started to run. This is the script used to submit the jobs: https://github.com/asantra/ScriptsForDiracGrid/blob/master/make_dirac_lhe_jo... This is the bash script: https://github.com/asantra/ScriptsForDiracGrid/blob/master/run_lhe_v49r8.sh Can you look into this issue? Please let me know if you need more details to investigate the issue. Thank you, Arka -- =============================================================== *Arka Santra* *Post Doctoral Researcher* * Instituto de Fisica Corpuscular* * Parque Cientifico* * C/Catedratico Jose Beltran, 2* * E-46980, Paterna* * Espana* ===============================================================
Hi Arka, I cannot see any jobs of yours in a state other than "Done" in our DIRAC instance. You should be able to check this yourself by going to: https://dirac.gridpp.ac.uk:8443/DIRAC (with a certificate in your browser) and then on the left clicking "JobMonitor" followed by "Submit" (it should put your username in automatically as it derives it from your certificate). I'm going to venture that is is a Ganga problem. Anything interesting in the logs on your side ? regards, Daniela On 10 May 2018 at 16:31, Arka Santra <santra.arka@gmail.com> wrote:
Hi, I am trying to do a MoEDAL simulation with Ganga. I submitted 1080 jobs initially. While 900 jobs were processed correctly producing the desired output, rest of the 180 jobs were stuck in the grid (with the status 'running') for more than 3 weeks. Last week, I killed those jobs and resent them to the gird. Since then, all my jobs are in the submitted state. Not a single one has started to run.
This is the script used to submit the jobs: https://github.com/asantra/ScriptsForDiracGrid/blob/ master/make_dirac_lhe_job.py This is the bash script: https://github.com/asantra/ScriptsForDiracGrid/blob/ master/run_lhe_v49r8.sh
Can you look into this issue? Please let me know if you need more details to investigate the issue.
Thank you, Arka
-- =============================================================== *Arka Santra*
*Post Doctoral Researcher* * Instituto de Fisica Corpuscular*
* Parque Cientifico* * C/Catedratico Jose Beltran, 2*
* E-46980, Paterna* * Espana* ===============================================================
-- _______________________________________________ Gridpp-Dirac-Users mailing list Gridpp-Dirac-Users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users
-- Sent from the pit of despair ----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
Hi all, There are some known issues with Ganga monitoring jobs stuck in a transient state, If you have jobs in the `submitting` or `completing` state that are stuck in this state between Ganga restarts then you should fail these jobs as Ganga will fail to monitor the (sub)jobs correctly. jobs(x).subjobs(y).force_status('failed') I would recommend setting the following in your .gangarc: [Registry] DisableLoadCheck=False This will forcibly fail jobs which are in a bad state when Ganga starts up. (Hopefully this will be the default at some point in future) Rob On 10/05/2018 04:42 PM, Daniela Bauer wrote:
Hi Arka,
I cannot see any jobs of yours in a state other than "Done" in our DIRAC instance. You should be able to check this yourself by going to: https://dirac.gridpp.ac.uk:8443/DIRAC (with a certificate in your browser) and then on the left clicking "JobMonitor" followed by "Submit" (it should put your username in automatically as it derives it from your certificate).
I'm going to venture that is is a Ganga problem. Anything interesting in the logs on your side ?
regards, Daniela
On 10 May 2018 at 16:31, Arka Santra <santra.arka@gmail.com> wrote:
Hi, I am trying to do a MoEDAL simulation with Ganga. I submitted 1080 jobs initially. While 900 jobs were processed correctly producing the desired output, rest of the 180 jobs were stuck in the grid (with the status 'running') for more than 3 weeks. Last week, I killed those jobs and resent them to the gird. Since then, all my jobs are in the submitted state. Not a single one has started to run.
This is the script used to submit the jobs: https://github.com/asantra/ScriptsForDiracGrid/blob/master/make_dirac_lhe_jo... [1] This is the bash script: https://github.com/asantra/ScriptsForDiracGrid/blob/master/run_lhe_v49r8.sh [2]
Can you look into this issue? Please let me know if you need more details to investigate the issue.
Thank you, Arka
--
=============================================================== _Arka Santra_ Post Doctoral Researcher
_ Instituto de Fisica Corpuscular_ Parque Cientifico
_ C/Catedratico Jose Beltran, 2_ E-46980, Paterna _ Espana_
===============================================================
-- _______________________________________________ Gridpp-Dirac-Users mailing list Gridpp-Dirac-Users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users [3]
--
Sent from the pit of despair
----------------------------------------------------------- daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/
Links: ------ [1] https://github.com/asantra/ScriptsForDiracGrid/blob/master/make_dirac_lhe_jo... [2] https://github.com/asantra/ScriptsForDiracGrid/blob/master/run_lhe_v49r8.sh [3] https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
participants (3)
-
Arka Santra
-
Daniela Bauer
-
Robert Currie