Re: [firedrake] Crash when running at higher order on ARCHER: resolved now
Hi Eike, Sorry for this. I though I had fixed that sort of bug, but probably this is triggered by something else. Actually, it should not even be the wrapper's responsibility to construct arrays of that size - that is, everything should happen in COFFEE. I'm planning to do this in the next few days. In the meanwhile, could you file a bug under "issues" on github/coffee ? could you paste the form, or even better, a sort of minimal test that leads to this error? Thank you -- Fabio 2015-02-14 10:53 GMT+00:00 Eike Mueller <e.mueller@bath.ac.uk>:
Dear firedrakers,
I finally got to the bottom of this. It turns out that I had set parameters[“COFFEE”][“O2”]= False around the parloop which executes the kernel, but not around the bit of code which actually compiles the UFL form. Stupid mistake… So this caused a horrible segfault, since the kernel expected data of size A[8][20], but it was passed A[6][18]. It was quite tricky to find this kind of bug, though, since it only segfaults without much information. I finally managed to run interactively on ARCHER, inspected the core dump with gdb and looked at the generated c-code. I was wondering whether this kind of issue can be detected when you generate the wrapper code? Don’t you know both the signature of the function and the passed data at this point?
Or has the COFFEE optimisation issue been resolved? I pulled the latest version of COFFEE, though.
Thanks,
Eike
--
Dr Eike Hermann Mueller Research Associate (PostDoc)
Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom
+44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
On 5 Feb 2015, at 16:28, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Thanks, I tried the atp and also inspected the core dump with
gdb python core
There is no backtrace in the core dump, and ATP does not generate any information either.
I still only get the segfault in my output file. I hope I can localise this a bit more tomorrow.
Eike
On 05/02/15 15:23, Patrick Farrell wrote:
On 05/02/15 14:37, Lawrence Mitchell wrote:
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information.
Try running again with
module load atp export ATP_ENABLED=1
Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php
Patrick
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
Hi Fabio, no worries. I just submitted an issue on the COFFEE github page, including a minimal example which crashes with the segfault. It only happened on ARCHER, and it goes away if I set PYOP2_DEBUG=1, but this might just be a coincidence and I suspect the problem is machine-independent. Thanks, Eike
On 14 Feb 2015, at 12:10, Fabio Luporini <f.luporini12@imperial.ac.uk> wrote:
Hi Eike,
Sorry for this. I though I had fixed that sort of bug, but probably this is triggered by something else. Actually, it should not even be the wrapper's responsibility to construct arrays of that size - that is, everything should happen in COFFEE. I'm planning to do this in the next few days.
In the meanwhile, could you file a bug under "issues" on github/coffee ? could you paste the form, or even better, a sort of minimal test that leads to this error?
Thank you
-- Fabio
2015-02-14 10:53 GMT+00:00 Eike Mueller <e.mueller@bath.ac.uk <mailto:e.mueller@bath.ac.uk>>: Dear firedrakers,
I finally got to the bottom of this. It turns out that I had set parameters[“COFFEE”][“O2”]= False around the parloop which executes the kernel, but not around the bit of code which actually compiles the UFL form. Stupid mistake… So this caused a horrible segfault, since the kernel expected data of size A[8][20], but it was passed A[6][18]. It was quite tricky to find this kind of bug, though, since it only segfaults without much information. I finally managed to run interactively on ARCHER, inspected the core dump with gdb and looked at the generated c-code. I was wondering whether this kind of issue can be detected when you generate the wrapper code? Don’t you know both the signature of the function and the passed data at this point?
Or has the COFFEE optimisation issue been resolved? I pulled the latest version of COFFEE, though.
Thanks,
Eike
--
Dr Eike Hermann Mueller Research Associate (PostDoc)
Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom
+44 1225 38 5803 e.mueller@bath.ac.uk <mailto:e.mueller@bath.ac.uk> http://people.bath.ac.uk/em459/ <http://people.bath.ac.uk/em459/>
On 5 Feb 2015, at 16:28, Eike Mueller <E.Mueller@bath.ac.uk <mailto:E.Mueller@bath.ac.uk>> wrote:
Thanks, I tried the atp and also inspected the core dump with
gdb python core
There is no backtrace in the core dump, and ATP does not generate any information either.
I still only get the segfault in my output file. I hope I can localise this a bit more tomorrow.
Eike
On 05/02/15 15:23, Patrick Farrell wrote:
On 05/02/15 14:37, Lawrence Mitchell wrote:
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information.
Try running again with
module load atp export ATP_ENABLED=1
Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php <http://www.archer.ac.uk/documentation/best-practice-guide/debug.php>
Patrick
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk <mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk <mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
participants (2)
- 
                
                Eike Mueller
- 
                
                Fabio Luporini