Problem when using MPI
Dear All, Has any of you have ever had any problems with GSMPI when in parallel? I am having some problems running Nektar in parallel on SHARCNET (https://www.sharcnet.ca). I have run Nektar in parallel before on other clusters (https://c-cfd.meil.pw.edu.pl/hpc/), and had no problems using it. Let me describe it: If I do: IncNavierStokesSolver geom.xml cond.xml all is good, If I do (also through the pbs system) mpirun -n 2 IncNavierStokesSolver geom.xml cond.xml I get a Segmentation fault. If I try to delay process communication (e.g. by loading prepartitioned meshes) I can see some output. So I think the problem starts with the first MPI send/recive. That is why think there is something with the GSMPI since it wraps the MPI communication (right?). I have already tried with different compilers (intel and gcc), different boost distributions etc. Are you aware if there might be any problems with certain OpenMPI versions, or something like this? Best Regards, Stan Gepner
Hi Stan, Hmm, that is indeed strange. Your summary is correct - most of our communication is wrapped by gslib. What version of the code are you running? We did make some changes recently to deal with a couple of MPI initialisation bugs on OS X, but I think this was specific to PETSc routines. However it may be impacting you for some reason. If you have the capability to run interactively and X-forward, perhaps we can have you run a short debugging session to see where the code is segfaulting? Cheers, Dave On 27 May 2015, at 15:39, Stanisław Gepner <sgepner@meil.pw.edu.pl> wrote:
Dear All,
Has any of you have ever had any problems with GSMPI when in parallel?
I am having some problems running Nektar in parallel on SHARCNET (https://www.sharcnet.ca). I have run Nektar in parallel before on other clusters (https://c-cfd.meil.pw.edu.pl/hpc/), and had no problems using it.
Let me describe it:
If I do: IncNavierStokesSolver geom.xml cond.xml all is good,
If I do (also through the pbs system) mpirun -n 2 IncNavierStokesSolver geom.xml cond.xml I get a Segmentation fault.
If I try to delay process communication (e.g. by loading prepartitioned meshes) I can see some output. So I think the problem starts with the first MPI send/recive. That is why think there is something with the GSMPI since it wraps the MPI communication (right?).
I have already tried with different compilers (intel and gcc), different boost distributions etc. Are you aware if there might be any problems with certain OpenMPI versions, or something like this?
Best Regards, Stan Gepner
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
-- David Moxey (Research Associate) d.moxey@imperial.ac.uk | www.imperial.ac.uk/people/d.moxey Room 363, Department of Aeronautics, Imperial College London, London, SW7 2AZ, UK.
Hi Dave, I am running the one from git (I do regularly pull it to see what you are up to ;) ). I do not think that this has to do with PETSc. I tried with, and without it with the same result. But, as you say, the changes might be impacting this somehow. I have access to development nodes, which are basically the same as cluster nodes. I get the same Segmentation there, so the debugging can be done there. Is there any specific way you would like it done (GDB?). Honestly, with mpi I normally use old good printf() to find the problem..., but I am open to new ideas. Regards, Stan On 28.05.2015 05:49, David Moxey wrote:
Hi Stan,
Hmm, that is indeed strange. Your summary is correct - most of our communication is wrapped by gslib.
What version of the code are you running? We did make some changes recently to deal with a couple of MPI initialisation bugs on OS X, but I think this was specific to PETSc routines. However it may be impacting you for some reason.
If you have the capability to run interactively and X-forward, perhaps we can have you run a short debugging session to see where the code is segfaulting?
Cheers,
Dave
On 27 May 2015, at 15:39, Stanisław Gepner <sgepner@meil.pw.edu.pl> wrote:
Dear All,
Has any of you have ever had any problems with GSMPI when in parallel?
I am having some problems running Nektar in parallel on SHARCNET (https://www.sharcnet.ca). I have run Nektar in parallel before on other clusters (https://c-cfd.meil.pw.edu.pl/hpc/), and had no problems using it.
Let me describe it:
If I do: IncNavierStokesSolver geom.xml cond.xml all is good,
If I do (also through the pbs system) mpirun -n 2 IncNavierStokesSolver geom.xml cond.xml I get a Segmentation fault.
If I try to delay process communication (e.g. by loading prepartitioned meshes) I can see some output. So I think the problem starts with the first MPI send/recive. That is why think there is something with the GSMPI since it wraps the MPI communication (right?).
I have already tried with different compilers (intel and gcc), different boost distributions etc. Are you aware if there might be any problems with certain OpenMPI versions, or something like this?
Best Regards, Stan Gepner
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
-- David Moxey (Research Associate) d.moxey@imperial.ac.uk | www.imperial.ac.uk/people/d.moxey
Room 363, Department of Aeronautics, Imperial College London, London, SW7 2AZ, UK.
participants (2)
- 
                
                David Moxey
- 
                
                Stanisław Gepner