Re: [firedrake] Errors running Firedrake on multiple nodes

19 Jul 2018

      On 19/07/18 11:09, Ham, David A wrote:
...
Hi Francis,

It seems to be the case that forking compilers during mpi runs on your
cluster is unreliable. This can be an issue with some MPI setups. You
can attempt to get to the bottom of it but that will be a deep dive
into the MPI systems on your cluster which has little, if anything, to
do with Firedrake.

There are two potential fixes to your current problems. One is to
modify PyOP2 to avoid executing compilers until after the cache miss
occurs. That would probably make the warm cache workaround work for you.

The other, probably more robust solution, would be to add to PyOP2 the
possibility to call llvm as a library and therefore to avoid the fork
call at all. This is a lot more work. I’m not sure if we get some of
this for free when we move to the loopy backend.
A third option, which avoids the MPI issue, is to do the following:

1. Before calling MPI_Init, fork() a child process.  This process will
be able to call fork() later.

2. Now intercept all subprocess calls in MPI processes and forward
them to this child process, that can actually do forking.

Andreas wrote something to do that here:
https://github.com/inducer/pytools/blob/master/pytools/prefork.py

One has to be a little careful to carefully audit all the places where
we might launch a subprocess.

Cheers,

Lawrence

Re: [firedrake] Errors running Firedrake on multiple nodes

Lawrence Mitchell