Crash when running at higher order on ARCHER
Dear firedrakers, I still get a crash when running at higher order on ARCHER (but it runs fine at lowest order). The output is below. This is odd, since the problem appears to be in the mesh generation, which should be the same for lowest- and higher order? This time I chose the problem size small enough (problems of this size run on my laptop, and I use a full node on ARCHER). I use the knepley/fix-plex-1d-refinement branch of PETSc and the firedrake branch of petsc4py On my laptop (where it works) I use dmplex-1d-refinement branch of petsc. Thanks, Eike eike@eslogin002 $ cat helmholtz_2692059.sdb/output.log Running helmholtz PBS_JOBID = 2692059.sdb Started atThu Feb 5 08:39:45 GMT 2015 +---------------------------+ ! Mixed Gravity wave solver ! +---------------------------+ Running on 24 MPI processes *** Parameters *** General: solve_matrixfree = True solve_petsc = True warmup_run = True orography = False nu_cfl = 10.0 speed_N = 0.01 speed_c = 300.0 Output: savetodisk = False output_dir = output Grid: nlayer = 64 ref_count_coarse = 0 nlevel = 3 thickness = 10000.0 Mixed system: higher_order = True verbose = 2 schur_diagonal_only = False maxiter = 20 tolerance = 1e-05 ksp_type = gmres Pressure solve: maxiter = 3 tolerance = 1e-14 verbose = 1 ksp_type = cg Multigrid: mu_relax = 0.8 n_postsmooth = 1 n_coarsesmooth = 1 n_presmooth = 1 Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> main(parameter_filename) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main main(parameter_filename) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy refinement_level=ref_count_coarse) refinement_level=ref_count_coarse) main(parameter_filename) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy main(parameter_filename) refinement_level=ref_count_coarse) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh m = mesh.Mesh(plex, dim=3, reorder=reorder) m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh File "<string>", line 2, in Mesh m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy refinement_level=ref_count_coarse) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper return f(*args, **kwargs) return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) AssertionError AssertionError AssertionError AssertionError Rank 13 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13 Rank 19 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19 Rank 1 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 Rank 7 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7 _pmiu_daemon(SIGCHLD): [NID 01517] [c7-0c2s11n1] [Thu Feb 5 08:40:23 2015] PE RANK 1 exit signal Aborted [NID 01517] 2015-02-05 08:40:23 Apid 12877131: initiated application termination Application 12877131 exit codes: 134 Application 12877131 resources: utime ~1s, stime ~15s, Rss ~59684, inblocks ~104518, outblocks ~790 Finished atThu Feb 5 08:40:25 GMT 2015
Hello Eike, It fails exactly before the Mesh factory function would identify whether to create a simplex or quadrilateral mesh object. It tries to retrieve how many facets a cell has, which fails the assertion pStart <= cStart < pEnd. The only explanation I have is that some MPI processes own 0 (zero) cells of the mesh. Maybe try using less MPI nodes or a higher refinement (more cells). Regards, Miklos ________________________________________ From: firedrake-bounces@imperial.ac.uk [firedrake-bounces@imperial.ac.uk] on behalf of Eike Mueller [E.Mueller@bath.ac.uk] Sent: 05 February 2015 08:57 To: firedrake Subject: [firedrake] Crash when running at higher order on ARCHER Dear firedrakers, I still get a crash when running at higher order on ARCHER (but it runs fine at lowest order). The output is below. This is odd, since the problem appears to be in the mesh generation, which should be the same for lowest- and higher order? This time I chose the problem size small enough (problems of this size run on my laptop, and I use a full node on ARCHER). I use the knepley/fix-plex-1d-refinement branch of PETSc and the firedrake branch of petsc4py On my laptop (where it works) I use dmplex-1d-refinement branch of petsc. Thanks, Eike eike@eslogin002 $ cat helmholtz_2692059.sdb/output.log Running helmholtz PBS_JOBID = 2692059.sdb Started atThu Feb 5 08:39:45 GMT 2015 +---------------------------+ ! Mixed Gravity wave solver ! +---------------------------+ Running on 24 MPI processes *** Parameters *** General: solve_matrixfree = True solve_petsc = True warmup_run = True orography = False nu_cfl = 10.0 speed_N = 0.01 speed_c = 300.0 Output: savetodisk = False output_dir = output Grid: nlayer = 64 ref_count_coarse = 0 nlevel = 3 thickness = 10000.0 Mixed system: higher_order = True verbose = 2 schur_diagonal_only = False maxiter = 20 tolerance = 1e-05 ksp_type = gmres Pressure solve: maxiter = 3 tolerance = 1e-14 verbose = 1 ksp_type = cg Multigrid: mu_relax = 0.8 n_postsmooth = 1 n_coarsesmooth = 1 n_presmooth = 1 Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> Traceback (most recent call last): File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 486, in <module> main(parameter_filename) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main main(parameter_filename) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy refinement_level=ref_count_coarse) refinement_level=ref_count_coarse) main(parameter_filename) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy main(parameter_filename) refinement_level=ref_count_coarse) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 256, in main File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh m = mesh.Mesh(plex, dim=3, reorder=reorder) m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh File "<string>", line 2, in Mesh m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh param_grid['thickness']) File "/work/n02/n02/eike//git_workspace/firedrake-helmholtzsolver/source/driver.py", line 146, in build_mesh_hierarchy refinement_level=ref_count_coarse) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/utility_meshes.py", line 560, in IcosahedralSphereMesh m = mesh.Mesh(plex, dim=3, reorder=reorder) File "<string>", line 2, in Mesh File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper File "/work/n02/n02/eike/git_workspace/PyOP2/pyop2/profiling.py", line 197, in wrapper return f(*args, **kwargs) return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh return f(*args, **kwargs) File "/work/n02/n02/eike/git_workspace/firedrake/firedrake/mesh.py", line 210, in Mesh cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) cell_facets = plex.getConeSize(cStart) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) File "PETSc/DMPlex.pyx", line 140, in petsc4py.PETSc.DMPlex.getConeSize (src/petsc4py.PETSc.c:203913) AssertionError AssertionError AssertionError AssertionError Rank 13 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13 Rank 19 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19 Rank 1 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 Rank 7 [Thu Feb 5 08:40:23 2015] [c7-0c2s11n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7 _pmiu_daemon(SIGCHLD): [NID 01517] [c7-0c2s11n1] [Thu Feb 5 08:40:23 2015] PE RANK 1 exit signal Aborted [NID 01517] 2015-02-05 08:40:23 Apid 12877131: initiated application termination Application 12877131 exit codes: 134 Application 12877131 resources: utime ~1s, stime ~15s, Rss ~59684, inblocks ~104518, outblocks ~790 Finished atThu Feb 5 08:40:25 GMT 2015 _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/02/15 09:44, Homolya, Miklós wrote:
Hello Eike,
It fails exactly before the Mesh factory function would identify whether to create a simplex or quadrilateral mesh object. It tries to retrieve how many facets a cell has, which fails the assertion pStart <= cStart < pEnd. The only explanation I have is that some MPI processes own 0 (zero) cells of the mesh.
Good crystal-balling:
Maybe try using less MPI nodes or a higher refinement (more cells). ...
Running on 24 MPI processes
...
ref_count_coarse = 0
This coarse mesh has 20 cells, so four processes (at least) will not own any cells on the coarse mesh. While most of firedrake and pyop2 will work in this situation, various bits of mesh setup won't. Note that even if the coarsest mesh went through, when you refine 4 of the processes won't have any work. Lawrence -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJU0z3JAAoJECOc1kQ8PEYvW1oH/31a0AY/r9O2Axu0xgsYNi9C 28O4e/VrHl4jBhqa1Og+a7if6Puole7/5+aP97P5tszMvA4FRUA/0BudUnGmLpPV +vg+cf6RF98R9TepTDoBnbIk78q2iWcHeDNKpJFFK1lMWiy2+Bg3yFm29LqsIZ4V y+vN1rYusIiyALD0san7bY//OZGyfdORTxGqmJOl2uKWkguzcD9ubolowNvRLWTN Ipavo7ayfGitIcbHSe2g/yZNJFGCHZMAHG6zVm+fuzTNRbDcdF8QIHU/FAgaIZlp v+O1iPRONZicwyS0RYrOwJDqdtsrSW8qcVqAJPrREmq2TPErsaAuFkAZ5ME38js= =o7zR -----END PGP SIGNATURE-----
Hi Lawrence and Miklos, thanks for your replies.
This coarse mesh has 20 cells, so four processes (at least) will not own any cells on the coarse mesh. While most of firedrake and pyop2 will work in this situation, various bits of mesh setup won't. Note that even if the coarsest mesh went through, when you refine 4 of the processes won't have any work.
oh, yes, that's true. It also means that this will limit the strong scaling I can do with my code if I always want at least once cell per processor. I just repeated a run with 20*4 cells on the coarsest level on 24 cores. I use 64 layers and refine the coarse grid 3 times, to 5120*64 327680=3.3E5 cells in total on the finest grid, so that's far away from 2E9 where we run out of integers for global indices. It crashes with a segfault, see below. Eike Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/02/15 14:30, Eike Mueller wrote:
Hi Lawrence and Miklos,
thanks for your replies.
This coarse mesh has 20 cells, so four processes (at least) will not own any cells on the coarse mesh. While most of firedrake and pyop2 will work in this situation, various bits of mesh setup won't. Note that even if the coarsest mesh went through, when you refine 4 of the processes won't have any work.
oh, yes, that's true. It also means that this will limit the strong scaling I can do with my code if I always want at least once cell per processor.
Yes, we haven't thought hard (or really at all) about how to run on subsets of processes.
I just repeated a run with 20*4 cells on the coarsest level on 24 cores. I use 64 layers and refine the coarse grid 3 times, to 5120*64 327680=3.3E5 cells in total on the finest grid, so that's far away from 2E9 where we run out of integers for global indices.
It crashes with a segfault, see below.
Eike
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information. Lawrence -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJU04A0AAoJECOc1kQ8PEYvVBIH/1z9SV7g5Wrv0aUQayIOoRgP vn71/ypVS6YrzKnxQc6+WfU27V4VhgEmMG7hTzrgSV+TT8vox4ZMgpgUb8dfPkR6 0fQvbUHgBC1g7bR/ai9CXSjdchUTkGiJXx5v9lAXfn+PXW8TaGRo+TS0c5F5aYzF RsoclJEV+IrPwbpxLd4VK1plYxtUezpatTGxFn0MlJTSsus2g1FC5fEiYgAph6GG xjlysPTTDKrQNlydVwYZglLn2RhhGWmZu1XSOZwLSoAlO1sygiQTrv8zUcXrW95b rDAoEQv9nL/JTk8gY2JJJK1zm/i02E5n/Pda1swP/mDI3d9Xlc0riTXl+2A4FHI= =m2jS -----END PGP SIGNATURE-----
On 05/02/15 14:37, Lawrence Mitchell wrote:
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information.
Try running again with module load atp export ATP_ENABLED=1 Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php Patrick
Thanks, I tried the atp and also inspected the core dump with gdb python core There is no backtrace in the core dump, and ATP does not generate any information either. I still only get the segfault in my output file. I hope I can localise this a bit more tomorrow. Eike On 05/02/15 15:23, Patrick Farrell wrote:
On 05/02/15 14:37, Lawrence Mitchell wrote:
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information.
Try running again with
module load atp export ATP_ENABLED=1
Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php
Patrick
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
Dear firedrakers, I finally got to the bottom of this. It turns out that I had set parameters[“COFFEE”][“O2”]= False around the parloop which executes the kernel, but not around the bit of code which actually compiles the UFL form. Stupid mistake… So this caused a horrible segfault, since the kernel expected data of size A[8][20], but it was passed A[6][18]. It was quite tricky to find this kind of bug, though, since it only segfaults without much information. I finally managed to run interactively on ARCHER, inspected the core dump with gdb and looked at the generated c-code. I was wondering whether this kind of issue can be detected when you generate the wrapper code? Don’t you know both the signature of the function and the passed data at this point? Or has the COFFEE optimisation issue been resolved? I pulled the latest version of COFFEE, though. Thanks, Eike -- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
On 5 Feb 2015, at 16:28, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Thanks, I tried the atp and also inspected the core dump with
gdb python core
There is no backtrace in the core dump, and ATP does not generate any information either.
I still only get the segfault in my output file. I hope I can localise this a bit more tomorrow.
Eike
On 05/02/15 15:23, Patrick Farrell wrote:
On 05/02/15 14:37, Lawrence Mitchell wrote:
Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015
Hmm, that's not a lot of useful information.
Try running again with
module load atp export ATP_ENABLED=1
Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php
Patrick
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
Hi Eike, I think, in general, when you want to disable COFFEE, probably the best place to do so is very soon after "from firedrake import *". Or, alternatively, you may merge caffeine_withdrawal branch with whatever branch you are using, and use that version of Firedrake. Regards, Miklos ________________________________ From: firedrake-bounces@imperial.ac.uk [firedrake-bounces@imperial.ac.uk] on behalf of Eike Mueller [e.mueller@bath.ac.uk] Sent: 14 February 2015 10:53 To: firedrake Subject: Re: [firedrake] Crash when running at higher order on ARCHER: resolved now Dear firedrakers, I finally got to the bottom of this. It turns out that I had set parameters[“COFFEE”][“O2”]= False around the parloop which executes the kernel, but not around the bit of code which actually compiles the UFL form. Stupid mistake… So this caused a horrible segfault, since the kernel expected data of size A[8][20], but it was passed A[6][18]. It was quite tricky to find this kind of bug, though, since it only segfaults without much information. I finally managed to run interactively on ARCHER, inspected the core dump with gdb and looked at the generated c-code. I was wondering whether this kind of issue can be detected when you generate the wrapper code? Don’t you know both the signature of the function and the passed data at this point? Or has the COFFEE optimisation issue been resolved? I pulled the latest version of COFFEE, though. Thanks, Eike -- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk<mailto:e.mueller@bath.ac.uk> http://people.bath.ac.uk/em459/ On 5 Feb 2015, at 16:28, Eike Mueller <E.Mueller@bath.ac.uk<mailto:E.Mueller@bath.ac.uk>> wrote: Thanks, I tried the atp and also inspected the core dump with gdb python core There is no backtrace in the core dump, and ATP does not generate any information either. I still only get the segfault in my output file. I hope I can localise this a bit more tomorrow. Eike On 05/02/15 15:23, Patrick Farrell wrote: On 05/02/15 14:37, Lawrence Mitchell wrote: Number of cells on finest grid = 5120 dx = 364.458 km, dt = 2429.717 s _pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Thu Feb 5 14:22:05 2015] PE RANK 11 exit signal Segmentation fault [NID 01160] 2015-02-05 14:22:05 Apid 12880356: initiated application termination Application 12880356 exit codes: 139 Application 12880356 resources: utime ~31s, stime ~19s, Rss ~318352, inblocks ~104428, outblocks ~788 Finished atThu Feb 5 14:22:10 GMT 2015 Hmm, that's not a lot of useful information. Try running again with module load atp export ATP_ENABLED=1 Sometimes it gives useful information about abnormal terminations; http://www.archer.ac.uk/documentation/best-practice-guide/debug.php Patrick _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk<mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake
participants (5)
- 
                
                Eike Mueller
- 
                
                Eike Mueller
- 
                
                Homolya, Miklós
- 
                
                Lawrence Mitchell
- 
                
                Patrick Farrell