On 19/07/18 11:09, Ham, David A wrote:
Hi Francis,
It seems to be the case that forking compilers during mpi runs on your cluster is unreliable. This can be an issue with some MPI setups. You can attempt to get to the bottom of it but that will be a deep dive into the MPI systems on your cluster which has little, if anything, to do with Firedrake.
There are two potential fixes to your current problems. One is to modify PyOP2 to avoid executing compilers until after the cache miss occurs. That would probably make the warm cache workaround work for you.
The other, probably more robust solution, would be to add to PyOP2 the possibility to call llvm as a library and therefore to avoid the fork call at all. This is a lot more work. I’m not sure if we get some of this for free when we move to the loopy backend.
A third option, which avoids the MPI issue, is to do the following: 1. Before calling MPI_Init, fork() a child process. This process will be able to call fork() later. 2. Now intercept all subprocess calls in MPI processes and forward them to this child process, that can actually do forking. Andreas wrote something to do that here: https://github.com/inducer/pytools/blob/master/pytools/prefork.py One has to be a little careful to carefully audit all the places where we might launch a subprocess. Cheers, Lawrence