Re: [firedrake] firedrake on ARCHER - helmholtz solver for larger core counts

6 Oct 2014


      On 06/10/14 10:37, Eike Mueller wrote:
...
Hi Florian,
thanks for your email and suggestions. Here is some more data:
...
These mesh sizes should all be manageable afaict, I have run on larger
meshes (up to ~60M cells afair).
...
In particular, I was not able to run on 48 cores (i.e. 2 full nodes)
1 node, 16 cores/node = 16 cores, refcount_coarse=5:  Runs
2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without
printing error message
I presume you mean 24 cores/node? How long did this run for? Was it
eventually killed by the scheduler?
Yes, sorry, should be 24 cores/node for refcount_coarse=4,5. However,
for refcount=6 I ran on 16 cores/node, so 32 cores in total.
Can't tell exactly how long the refcount_coarse=4 case ran until
crashed, but the wallclock time limit was set to 5 minutes, and I'm
pretty sure it ran at least for a couple of minutes, i.e. did not abort
immediately.
...
...
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc
error message
This isn't a PETSc error, it's only PETSc reporting the crash. The
actual issue is a compile error. Can you check whether
/work/n02/n02/eike//pyop2-cache/9b5f79130a429e4044a23dacb78191df.so or
.so.tmp exists?
Ah, good to know it's not a PETSc error, I was worried that I had to
rebuild PETSc... But why does PETSc report the crash if it is an issue
with the compilation?
...
...
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with
different error message
This is actually the same error. Does
/work/n02/n02/eike//pyop2-cache/c8aac6eec48490d587593472c04a5557.so or
.so.tmp exist?
Can't check if the .so, .so.tmp files exist, but I just repeated a run
with refcount_coarse=5 on 48 cores (2 full ARCHER nodes). As before, it
crashes. I attach all files generated by this run, as well as the
submission scripts as a .tgz file to this email.
For this I also copied the files
478ec26b47996566adb89c5545f64b5c.{c,err,log} for the pyop2-cache
directory, for this run no .so files were generated as far as I can see.
The .err file is empty.
I manually compiled the code on a login node by copying the contents of
478ec26b47996566adb89c5545f64b5c.log. This works as it should and
produces 478ec26b47996566adb89c5545f64b5c.so.tmp, but only if I have
sourced my firedrake.env. Is it possible that this does not get loaded
properly in some cases? But then it should also crash for smaller core
counts.
As suggested by Lawrence the best way to debug this is running with
PYOP2_DEBUG=1. Has this revealed anything?
...
...
...
Output logs for all cases are attached, as well as the pbs scripts for
running on 48 cores.
The job script appears to be for a 32 core run. It looks legit, though
you wouldn't need to load the firedrake module given that you also load
the dependency modules and use your own PyOP2/Firedrake
Just to clarify, you mean that I can remove
module load firedrake
but should leave
module load fdrake-build-env
module load fdrake-python-env
in? I just removed the "module load firedrake" and it does not make any
difference (but I guess you were not implying that it would fix the
problem anyway).
Yes, you're right. This was just an FYI, I wasn't expecting that to be
the issue.
...
...
...
I also tried running the poisson example in firedrake-benchmark in
different configurations:
1 node, 16 cores/node = 16 cores => works
2 nodes, 8 cores/node = 16 cores => works
2 nodes, 24 cores/node = 48 cores => crashes with PETSc error
This one is baffling: it appears that only ranks 24-47 are aborted. I
have absolutely no clue what is going on. Is this reproducible?
I will look at this again.
...
...
I did use the generic ARCHER firedrake installation in this case,
though. In the Helmholtz solver runs I used my own version of firedrake
and PyOP2 since I need the multigrid and local_par-loop branches.
This makes me suspicious that it is not a problem specific to my code,
but then I thought that you did run firedrake on relatively large core
counts before, didn't you?
I have run on up to 1536 cores for Poisson and Cahn-Hilliard.
...
The output of the Poisson benchmarks is also attached.
Any ideas what might be going wrong and what is the best way for
debugging this?
I'm not entirely sure what to suggest about the run that just hangs
other than trying to pinpoint where exactly that happens.
For the compilation failures, I recall I had those sporadically and
could never figure out what exactly had gone wrong. Check what is in the
cache and also what the compiled code represents: and assembly or an
expression?
For the Helmholtz solver run above it seems to be an assembly (see
output.log):
assemble(self.psi*div(self.dMinvMr_u)*dx))
Is either self.psi or self.dMinvMr_u a literal constant that could vary
between processes? If so, this will not work since all processes *must*
run the same code.

Florian
...
Cheers,
Eike
...
Florian
...
Thanks,
Eike

Re: [firedrake] firedrake on ARCHER - helmholtz solver for larger core counts

Florian Rathgeber