firedrake on ARCHER - helmholtz solver for larger core counts
Dear firedrakers, I tried to run the Helmholtz multigrid solver on higher core counts on ARCHER. In certain configurations it works, but in others it crashes. The number of fine grid cells is 20*4^(4+refcount_coarse). In particular, I was not able to run on 48 cores (i.e. 2 full nodes) 1 node, 16 cores/node = 16 cores, refcount_coarse=5: Runs 2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message 2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message 2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores. I also tried running the poisson example in firedrake-benchmark in different configurations: 1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches. This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you? The output of the Poisson benchmarks is also attached. Any ideas what might be going wrong and what is the best way for debugging this? Thanks, Eike -- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
On 04/10/14 13:15, Eike Mueller wrote:
Dear firedrakers,
I tried to run the Helmholtz multigrid solver on higher core counts on ARCHER. In certain configurations it works, but in others it crashes.
The number of fine grid cells is 20*4^(4+refcount_coarse).
These mesh sizes should all be manageable afaict, I have run on larger meshes (up to ~60M cells afair).
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5: Runs 2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message
I presume you mean 24 cores/node? How long did this run for? Was it eventually killed by the scheduler?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message
This isn't a PETSc error, it's only PETSc reporting the crash. The actual issue is a compile error. Can you check whether /work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or .so.tmp exists?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message
This is actually the same error. Does /work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or .so.tmp exist?
Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores.
The job script appears to be for a 32 core run. It looks legit, though you wouldn't need to load the firedrake module given that you also load the dependency modules and use your own PyOP2/Firedrake
I also tried running the poisson example in firedrake-benchmark in different configurations:
1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error
This one is baffling: it appears that only ranks 24-47 are aborted. I have absolutely no clue what is going on. Is this reproducible?
I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you?
I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for debugging this?
I'm not entirely sure what to suggest about the run that just hangs other than trying to pinpoint where exactly that happens. For the compilation failures, I recall I had those sporadically and could never figure out what exactly had gone wrong. Check what is in the cache and also what the compiled code represents: and assembly or an expression? Florian
Thanks,
Eike
Hi Florian, thanks for your email and suggestions. Here is some more data:
These mesh sizes should all be manageable afaict, I have run on larger meshes (up to ~60M cells afair).
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5: Runs 2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message I presume you mean 24 cores/node? How long did this run for? Was it eventually killed by the scheduler? Yes, sorry, should be 24 cores/node for refcount_coarse=4,5. However, for refcount=6 I ran on 16 cores/node, so 32 cores in total.
Can't tell exactly how long the refcount_coarse=4 case ran until crashed, but the wallclock time limit was set to 5 minutes, and I'm pretty sure it ran at least for a couple of minutes, i.e. did not abort immediately.
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message This isn't a PETSc error, it's only PETSc reporting the crash. The actual issue is a compile error. Can you check whether /work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or .so.tmp exists? Ah, good to know it's not a PETSc error, I was worried that I had to rebuild PETSc... But why does PETSc report the crash if it is an issue with the compilation?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message This is actually the same error. Does /work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or .so.tmp exist?
Can't check if the .so, .so.tmp files exist, but I just repeated a run with refcount_coarse=5 on 48 cores (2 full ARCHER nodes). As before, it crashes. I attach all files generated by this run, as well as the submission scripts as a .tgz file to this email. For this I also copied the files 478ec26b47996566adb89c5545f64b5c.{c,err,log} for the pyop2-cache directory, for this run no .so files were generated as far as I can see. The .err file is empty. I manually compiled the code on a login node by copying the contents of 478ec26b47996566adb89c5545f64b5c.log. This works as it should and produces 478ec26b47996566adb89c5545f64b5c.so.tmp, but only if I have sourced my firedrake.env. Is it possible that this does not get loaded properly in some cases? But then it should also crash for smaller core counts.
Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores. The job script appears to be for a 32 core run. It looks legit, though you wouldn't need to load the firedrake module given that you also load the dependency modules and use your own PyOP2/Firedrake
Just to clarify, you mean that I can remove module load firedrake but should leave module load fdrake-build-env module load fdrake-python-env in? I just removed the "module load firedrake" and it does not make any difference (but I guess you were not implying that it would fix the problem anyway).
I also tried running the poisson example in firedrake-benchmark in different configurations:
1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error This one is baffling: it appears that only ranks 24-47 are aborted. I have absolutely no clue what is going on. Is this reproducible?
I will look at this again.
I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you? I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for debugging this? I'm not entirely sure what to suggest about the run that just hangs other than trying to pinpoint where exactly that happens.
For the compilation failures, I recall I had those sporadically and could never figure out what exactly had gone wrong. Check what is in the cache and also what the compiled code represents: and assembly or an expression?
For the Helmholtz solver run above it seems to be an assembly (see output.log): assemble(self.psi*div(self.dMinvMr_u)*dx)) Cheers, Eike
Florian
Thanks,
Eike
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
On 06/10/14 10:37, Eike Mueller wrote:
Hi Florian,
thanks for your email and suggestions. Here is some more data:
These mesh sizes should all be manageable afaict, I have run on larger meshes (up to ~60M cells afair).
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5: Runs 2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message I presume you mean 24 cores/node? How long did this run for? Was it eventually killed by the scheduler? Yes, sorry, should be 24 cores/node for refcount_coarse=4,5. However, for refcount=6 I ran on 16 cores/node, so 32 cores in total.
Can't tell exactly how long the refcount_coarse=4 case ran until crashed, but the wallclock time limit was set to 5 minutes, and I'm pretty sure it ran at least for a couple of minutes, i.e. did not abort immediately.
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message This isn't a PETSc error, it's only PETSc reporting the crash. The actual issue is a compile error. Can you check whether /work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or .so.tmp exists? Ah, good to know it's not a PETSc error, I was worried that I had to rebuild PETSc... But why does PETSc report the crash if it is an issue with the compilation?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message This is actually the same error. Does /work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or .so.tmp exist?
Can't check if the .so, .so.tmp files exist, but I just repeated a run with refcount_coarse=5 on 48 cores (2 full ARCHER nodes). As before, it crashes. I attach all files generated by this run, as well as the submission scripts as a .tgz file to this email.
For this I also copied the files 478ec26b47996566adb89c5545f64b5c.{c,err,log} for the pyop2-cache directory, for this run no .so files were generated as far as I can see. The .err file is empty.
I manually compiled the code on a login node by copying the contents of 478ec26b47996566adb89c5545f64b5c.log. This works as it should and produces 478ec26b47996566adb89c5545f64b5c.so.tmp, but only if I have sourced my firedrake.env. Is it possible that this does not get loaded properly in some cases? But then it should also crash for smaller core counts.
As suggested by Lawrence the best way to debug this is running with PYOP2_DEBUG=1. Has this revealed anything?
Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores. The job script appears to be for a 32 core run. It looks legit, though you wouldn't need to load the firedrake module given that you also load the dependency modules and use your own PyOP2/Firedrake
Just to clarify, you mean that I can remove
module load firedrake
but should leave
module load fdrake-build-env module load fdrake-python-env
in? I just removed the "module load firedrake" and it does not make any difference (but I guess you were not implying that it would fix the problem anyway).
Yes, you're right. This was just an FYI, I wasn't expecting that to be the issue.
I also tried running the poisson example in firedrake-benchmark in different configurations:
1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error This one is baffling: it appears that only ranks 24-47 are aborted. I have absolutely no clue what is going on. Is this reproducible?
I will look at this again.
I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you? I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for debugging this? I'm not entirely sure what to suggest about the run that just hangs other than trying to pinpoint where exactly that happens.
For the compilation failures, I recall I had those sporadically and could never figure out what exactly had gone wrong. Check what is in the cache and also what the compiled code represents: and assembly or an expression?
For the Helmholtz solver run above it seems to be an assembly (see output.log):
assemble(self.psi*div(self.dMinvMr_u)*dx))
Is either self.psi or self.dMinvMr_u a literal constant that could vary between processes? If so, this will not work since all processes *must* run the same code. Florian
Cheers,
Eike
Florian
Thanks,
Eike
Hi Florian,
I also tried running the poisson example in firedrake-benchmark in different configurations:
1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error This one is baffling: it appears that only ranks 24-47 are aborted. I have absolutely no clue what is going on. Is this reproducible?
Just tried this again (twice) and it works now. Maybe there was a transient problem with one node. Thanks, Eike -- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
Dear all, I'm wondering whether this isn't an error in a PETSc call after all, since it complains about a segfault. I provide callbacks to PETSc KSP solvers in my code, and for this I have to set the size of vectors that the KSP operates on. In the mixed preconditioner, I want it to operate on a vector which contains both the pressure and the velocity dofs. To count the total number of dofs I do this: helmholtz.py: 60 self.ndof_phi = self.V_pressure.dof_dset.size self.ndof_u = self.V_velocity.dof_dset.size self.ndof = self.ndof_phi+self.ndof_u and then I set up the operator for the KSP like this: helmholtz.py: 69 op = PETSc.Mat().create() op.setSizes(((self.ndof, None), (self.ndof, None))) In the preconditioner class I copy the dofs from the vectors encapsulated in the firedrake pressure- and velocity functions in and out and then call my matrix-free solver routine: helmholtz.py: 365 with self.phi_tmp.dat.vec as v: v.array[:] = x.array[:self.ndof_phi] with self.u_tmp.dat.vec as v: v.array[:] = x.array[self.ndof_phi:] self.solve(self.phi_tmp,self.u_tmp, self.P_phi_tmp,self.P_u_tmp) with self.P_phi_tmp.dat.vec_ro as v: y.array[:self.ndof_phi] = v.array[:] with self.P_u_tmp.dat.vec_ro as v: y.array[self.ndof_phi:] = v.array[:] Is this the correct way of doing it, in particular the use of self.V_pressure.dof_dset.size? It's conceivable that it doesn't have problems for smaller grid sizes, and then for larger grid sizes it crashes on one processor where there is an out-of-bounds access, but it reports the compilation error on the master processor? Thanks, Eike On 04/10/14 14:10, Florian Rathgeber wrote:
On 04/10/14 13:15, Eike Mueller wrote:
Dear firedrakers,
I tried to run the Helmholtz multigrid solver on higher core counts on ARCHER. In certain configurations it works, but in others it crashes.
The number of fine grid cells is 20*4^(4+refcount_coarse). These mesh sizes should all be manageable afaict, I have run on larger meshes (up to ~60M cells afair).
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5: Runs 2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message I presume you mean 24 cores/node? How long did this run for? Was it eventually killed by the scheduler?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message This isn't a PETSc error, it's only PETSc reporting the crash. The actual issue is a compile error. Can you check whether /work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or .so.tmp exists?
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message This is actually the same error. Does /work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or .so.tmp exist?
Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores. The job script appears to be for a 32 core run. It looks legit, though you wouldn't need to load the firedrake module given that you also load the dependency modules and use your own PyOP2/Firedrake
I also tried running the poisson example in firedrake-benchmark in different configurations:
1 node, 16 cores/node = 16 cores => works 2 nodes, 8 cores/node = 16 cores => works 2 nodes, 24 cores/node = 48 cores => crashes with PETSc error This one is baffling: it appears that only ranks 24-47 are aborted. I have absolutely no clue what is going on. Is this reproducible?
I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you? I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for debugging this? I'm not entirely sure what to suggest about the run that just hangs other than trying to pinpoint where exactly that happens.
For the compilation failures, I recall I had those sporadically and could never figure out what exactly had gone wrong. Check what is in the cache and also what the compiled code represents: and assembly or an expression?
Florian
Thanks,
Eike
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
On 6 Oct 2014, at 14:08, Eike Mueller <e.mueller@bath.ac.uk> wrote:
Dear all,
I'm wondering whether this isn't an error in a PETSc call after all, since it complains about a segfault.
PETSc installs an error handler which reports errors as petsc errors even when the abort has come from elsewhere.
I provide callbacks to PETSc KSP solvers in my code, and for this I have to set the size of vectors that the KSP operates on. In the mixed preconditioner, I want it to operate on a vector which contains both the pressure and the velocity dofs. To count the total number of dofs I do this:
helmholtz.py: 60 self.ndof_phi = self.V_pressure.dof_dset.size self.ndof_u = self.V_velocity.dof_dset.size self.ndof = self.ndof_phi+self.ndof_u
and then I set up the operator for the KSP like this:
helmholtz.py: 69 op = PETSc.Mat().create() op.setSizes(((self.ndof, None), (self.ndof, None)))
In the preconditioner class I copy the dofs from the vectors encapsulated in the firedrake pressure- and velocity functions in and out and then call my matrix-free solver routine:
helmholtz.py: 365 with self.phi_tmp.dat.vec as v: v.array[:] = x.array[:self.ndof_phi] with self.u_tmp.dat.vec as v: v.array[:] = x.array[self.ndof_phi:] self.solve(self.phi_tmp,self.u_tmp, self.P_phi_tmp,self.P_u_tmp) with self.P_phi_tmp.dat.vec_ro as v: y.array[:self.ndof_phi] = v.array[:] with self.P_u_tmp.dat.vec_ro as v: y.array[self.ndof_phi:] = v.array[:]
Is this the correct way of doing it, in particular the use of self.V_pressure.dof_dset.size?
Yes, that's right. If the sizes are wrong you would see a numpy error about mismatching array sizes.
It's conceivable that it doesn't have problems for smaller grid sizes, and then for larger grid sizes it crashes on one processor where there is an out-of-bounds access, but it reports the compilation error on the master processor?
If you're getting (or were) compilation errors can you please run with "export PYOP2_DEBUG=1". It's possible that the code is not the same on all processes which would be caught by the above. Please note as well that the halo regions are going to be massive, I have some branches that build the correct shrunk halos, but they haven't landed yet and I'm somewhat incapacitated with a broken collarbone. You'd need: firedrake: multigrid-parallel pyop2: local-par_loop petsc4py: bitbucket.org/mapdes/petsc4py branch moar-plex petsc: mlange/plex-distributed-overlap Functionality similar to the latter should hopefully arrive in petsc master this week Lawrence
Dear all, I rebuilt 'my' PETSc with debugging enabled, and then also rebuilt petsc4py and firedrake. I could run on 48 cores with 7 levels once, but this was not reproducible. When I tried again it crashed with a PETSc error. Is there any chance that the branches below help? Do I need mlange/plex-distributed-overlap or can I use the PETSc master now? Thanks a lot, Eike
firedrake: multigrid-parallel pyop2: local-par_loop petsc4py: bitbucket.org/mapdes/petsc4py branch moar-plex petsc: mlange/plex-distributed-overlap
Functionality similar to the latter should hopefully arrive in petsc master this week
Lawrence _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
Dear all, with Lawrence's fix it now works with 10 levels (20,971,520 cells) on two full nodes. If I increase both the number of cells and the number of nodes by a factor of 4 (i.e. same #dof/core) I get a crash again. However, the position in the code where it crashes varies, here are two example: https://gist.github.com/eikehmueller/a07007af72b34a57efff In the 1st and 3rd run, I had cleared the PyOP2/firedrake cashes, whereas in the second case I'm not entirely sure what the state of the cache was. In the 3rd run I also set PYOP2_DEBUG=1, whereas it was 0 in the first two runs. The weird thing is also that the 3rd run seems to complete, but still generates the dreaded PETSc seg fault. Thanks, Eike -- Dr Eike Hermann Mueller Research Associate (PostDoc) Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom +44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/ On 13 Oct 2014, at 11:22, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Dear all,
I rebuilt 'my' PETSc with debugging enabled, and then also rebuilt petsc4py and firedrake. I could run on 48 cores with 7 levels once, but this was not reproducible. When I tried again it crashed with a PETSc error.
Is there any chance that the branches below help? Do I need mlange/plex-distributed-overlap or can I use the PETSc master now?
Thanks a lot,
Eike
firedrake: multigrid-parallel pyop2: local-par_loop petsc4py: bitbucket.org/mapdes/petsc4py branch moar-plex petsc: mlange/plex-distributed-overlap
Functionality similar to the latter should hopefully arrive in petsc master this week
Lawrence _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-- Dr Eike Hermann Mueller Research Associate (PostDoc)
Department of Mathematical Sciences University of Bath Bath BA2 7AY, United Kingdom
+44 1225 38 5803 e.mueller@bath.ac.uk http://people.bath.ac.uk/em459/
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
participants (3)
- 
                
                Eike Mueller
- 
                
                Florian Rathgeber
- 
                
                Lawrence Mitchell