Okay so on one compute node (20 cores, 2 sockets) works fine, even with the warning (originally my code hangs at 20 cores). However, if my sbatch script calls for more than one compute node my program freezes for anything > 20 processes. However, this also happens when I use MPICH. Now I am not sure if it's simply an issue with our university's HPC system or if MPICH also has the same problems as OpenMPI. On Thu, Aug 13, 2015 at 4:36 PM, Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk> wrote:
On 13 Aug 2015, at 19:53, Justin Chang <jychang48@gmail.com> wrote:
Lawrence,
When I compile everything with MPICH-3.1.4 on this machine, I get no complains whatsoever. It only happens when I use OpenMPI. I don't like the default binding options (or lack thereof) for MPICH and would prefer to use OpenMPI. Could this have something to do with the compilers that I am using? And/or how I am configuring openmpi and/or python?
I don't think it's to do with how you're configuring openmpi. It's rather that the infiniband support is "known bad" when forking, see this OpenMPI FAQ: https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
Our use of fork falls into the "calling system() or popen()" case, so plausibly you might be able to turn off that warning and continue. However, I recall you saying that your code just hangs when you do this, so maybe that's no good.
I could try this out on another HPC system i have access to (Intel Xeon E5-2670) to see if I can reproduce the problem, but this other machine has a firewall and makes the installation process even more troublesome...
I think we have infiniband-based clusters here, so hopefully we can reproduce at this end. There do appear to be some issues with robustness on these kind of systems though, so I'm definitely keen to fix things.
Lawrence
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake