Re: [firedrake] firedrake on ARCHER - helmholtz solver for larger core counts

6 Oct 2014

      Dear all,

I'm wondering whether this isn't an error in a PETSc call after all, 
since it complains about a segfault. I provide callbacks to PETSc KSP 
solvers in my code, and for this I have to set the size of vectors that 
the KSP operates on. In the mixed preconditioner, I want it to operate 
on a vector which contains both the pressure and the velocity dofs. To 
count the total number of dofs I do this:

helmholtz.py: 60
         self.ndof_phi = self.V_pressure.dof_dset.size
         self.ndof_u = self.V_velocity.dof_dset.size
         self.ndof = self.ndof_phi+self.ndof_u

and then I set up the operator for the KSP like this:

helmholtz.py: 69
         op = PETSc.Mat().create()
         op.setSizes(((self.ndof, None), (self.ndof, None)))

In the preconditioner class I copy the dofs from the vectors 
encapsulated in the firedrake pressure- and velocity functions in and 
out and then call my matrix-free solver routine:

helmholtz.py: 365
         with self.phi_tmp.dat.vec as v:
             v.array[:] = x.array[:self.ndof_phi]
         with self.u_tmp.dat.vec as v:
             v.array[:] = x.array[self.ndof_phi:]
         self.solve(self.phi_tmp,self.u_tmp,
                    self.P_phi_tmp,self.P_u_tmp)
         with self.P_phi_tmp.dat.vec_ro as v:
             y.array[:self.ndof_phi] = v.array[:]
         with self.P_u_tmp.dat.vec_ro as v:
             y.array[self.ndof_phi:] = v.array[:]

Is this the correct way of doing it, in particular the use of 
self.V_pressure.dof_dset.size?
It's conceivable that it doesn't have problems for smaller grid sizes, 
and then for larger grid sizes it crashes on one processor where there 
is an out-of-bounds access, but it reports the compilation error on the 
master processor?

Thanks,

Eike

On 04/10/14 14:10, Florian Rathgeber wrote:
...
On 04/10/14 13:15, Eike Mueller wrote:
...
Dear firedrakers,
I tried to run the Helmholtz multigrid solver on higher core counts on
ARCHER. In certain configurations it works, but in others it crashes.
The number of fine grid cells is 20*4^(4+refcount_coarse).
These mesh sizes should all be manageable afaict, I have run on larger
meshes (up to ~60M cells afair).
...
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5:  Runs
2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without
printing error message
I presume you mean 24 cores/node? How long did this run for? Was it
eventually killed by the scheduler?
...
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc
error message
This isn't a PETSc error, it's only PETSc reporting the crash. The
actual issue is a compile error. Can you check whether
/work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or
.so.tmp exists?
...
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with
different error message
This is actually the same error. Does
/work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or
.so.tmp exist?
...
Output logs for all cases are attached, as well as the pbs scripts for
running on 48 cores.
The job script appears to be for a 32 core run. It looks legit, though
you wouldn't need to load the firedrake module given that you also load
the dependency modules and use your own PyOP2/Firedrake
...
I also tried running the poisson example in firedrake-benchmark in
different configurations:
1 node, 16 cores/node = 16 cores => works
2 nodes, 8 cores/node = 16 cores => works
2 nodes, 24 cores/node = 48 cores => crashes with PETSc error
This one is baffling: it appears that only ranks 24-47 are aborted. I
have absolutely no clue what is going on. Is this reproducible?
...
I did use the generic ARCHER firedrake installation in this case,
though. In the Helmholtz solver runs I used my own version of firedrake
and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code,
but then I thought that you did run firedrake on relatively large core
counts before, didn't you?
I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
...
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for
debugging this?
I'm not entirely sure what to suggest about the run that just hangs
other than trying to pinpoint where exactly that happens.
For the compilation failures, I recall I had those sporadically and
could never figure out what exactly had gone wrong. Check what is in the
cache and also what the compiled code represents: and assembly or an
expression?
Florian
...
Thanks,
Eike
_______________________________________________
firedrake mailing list
firedrake@imperial.ac.uk
https://mailman.ic.ac.uk/mailman/listinfo/firedrake
-- 
Dr Eike Hermann Mueller
Research Associate (PostDoc)

Department of Mathematical Sciences
University of Bath
Bath BA2 7AY, United Kingdom

+44 1225 38 5803
e.mueller@bath.ac.uk
http://people.bath.ac.uk/em459/