Dear firedrakers,

I tried to run the Helmholtz multigrid solver on higher core counts on ARCHER. In certain configurations it works, but in others it crashes.

The number of fine grid cells is 20*4^(4+refcount_coarse).

In particular, I was not able to run on 48 cores (i.e. 2 full nodes)

1 node, 16 cores/node = 16 cores, refcount_coarse=5:  Runs
2 nores, 16 cores/node = 48 cores, refcount_coarse=4: hangs without printing error message
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with PETSc error message
2 nores, 16 cores/node = 48 cores, refcount_coarse=5: crashes with different error message

Output logs for all cases are attached, as well as the pbs scripts for running on 48 cores.

I also tried running the poisson example in firedrake-benchmark in different configurations:

1 node, 16 cores/node = 16 cores => works
2 nodes, 8 cores/node = 16 cores => works
2 nodes, 24 cores/node = 48 cores => crashes with PETSc error

I did use the generic ARCHER firedrake installation in this case, though. In the Helmholtz solver runs I used my own version of firedrake and PyOP2 since I need the multigrid and local_par-loop branches.

This makes me suspicious that it is not a problem specific to my code, but then I thought that you did run firedrake on relatively large core counts before, didn't you?

The output of the Poisson benchmarks is also attached. 

Any ideas what might be going wrong and what is the best way for debugging this?

Thanks,

Eike