Thank you Lawrence and David for your help with this.


I will contact the people who maintain the cluster and ask about the MPI setup but don't imagine that will be too productive, unless there are more cases where MPI seems to breakdown.  But I will try.


As for the 3 possibilities that have been brought up, what would be the easiest/fastest?  The code that Andreas wrote is there, which seems like it might be the easiest, but I can't say that I know what exactly to do with this.  


Which do you suggest we investigate?


Cheers, Francis



------------------
Francis Poulin  
Associate Dean, Undergraduate Studies                   
Professor
Department of Applied Mathematics
University of Waterloo

email:           fpoulin@uwaterloo.ca
Web:            https://uwaterloo.ca/poulin-research-group/
Telephone:  +1 519 888 4567 x32637


From: firedrake-bounces@imperial.ac.uk <firedrake-bounces@imperial.ac.uk> on behalf of Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk>
Sent: Thursday, July 19, 2018 6:18:16 AM
To: firedrake@imperial.ac.uk
Subject: Re: [firedrake] Errors running Firedrake on multiple nodes
 
On 19/07/18 11:09, Ham, David A wrote:
> Hi Francis,
>
>  
>
> It seems to be the case that forking compilers during mpi runs on your
> cluster is unreliable. This can be an issue with some MPI setups. You
> can attempt to get to the bottom of it but that will be a deep dive
> into the MPI systems on your cluster which has little, if anything, to
> do with Firedrake.
>
>  
>
> There are two potential fixes to your current problems. One is to
> modify PyOP2 to avoid executing compilers until after the cache miss
> occurs. That would probably make the warm cache workaround work for you.
>
>  
>
> The other, probably more robust solution, would be to add to PyOP2 the
> possibility to call llvm as a library and therefore to avoid the fork
> call at all. This is a lot more work. I’m not sure if we get some of
> this for free when we move to the loopy backend.

A third option, which avoids the MPI issue, is to do the following:

1. Before calling MPI_Init, fork() a child process.  This process will
be able to call fork() later.

2. Now intercept all subprocess calls in MPI processes and forward
them to this child process, that can actually do forking.

Andreas wrote something to do that here:
https://github.com/inducer/pytools/blob/master/pytools/prefork.py

One has to be a little careful to carefully audit all the places where
we might launch a subprocess.

Cheers,

Lawrence

_______________________________________________
firedrake mailing list
firedrake@imperial.ac.uk
https://mailman.ic.ac.uk/mailman/listinfo/firedrake