Re: [firedrake] Errors running Firedrake on multiple nodes

14 Jul 2018

      Hello,

The relevant message seems to be this:

A process has executed an operation involving a call to the "fork()" system call to create a child process.  Open MPI is currently operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent data corruption.  The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

Firedrake compiles the high-level specification of your PDE into efficient binary code dynamically, just in time. It first generates C code, that calls the host C compiler to compile that. This step fails because your MPI does not support the fork() system call.

Suggested workaround is to warm up your disk cache by first solving a small problem serially (or on a single node if that works), and then launch your parallel runs with warm cache, so Firedrake won't need to do further compilations.

Regards,

Miklos

________________________________
From: firedrake-bounces@imperial.ac.uk <firedrake-bounces@imperial.ac.uk> on behalf of Aaron Matthew Baier-Reinio <ambaierreinio@edu.uwaterloo.ca>
Sent: 13 July 2018 19:15:51
To: firedrake
Subject: [firedrake] Errors running Firedrake on multiple nodes

Hello Firedrake team,

I am trying to install Firedrake on a cluster and use mpi to run my Firedrake scripts on multiple nodes. I am running into some issues and was wondering if you would be able to help.

The operating system on the cluster is CentOS 7.4. Both Python 3.5 and PETSc 3.9.0 are already loaded on the cluster via the environment modules tool. So, to install I am running the install script as follows:

python3 firedrake-install --disable-ssh --no-package-manager --honour-petsc-dir --honour-pythonpath

The install seems to go through successfully. The resultant installation seems to work when running on a single node, but I am getting errors when I run on multiple nodes. If I run my script as

mpirun -np 8 python3 script.py

and all 8 CPUs are located on the same node, then the script seems to run successfully. However, if the CPUs are distributed across multiple nodes, then I always get error messages. The general pattern to these error messages is that I will first get a PyOP2 caching error that looks like this:

Traceback (most recent call last):
  File "/home/ambaierr/firedrake/src/PyOP2/pyop2/caching.py", line 197, in __new__
    return cls._cache_lookup(key)
  File "/home/ambaierr/firedrake/src/PyOP2/pyop2/caching.py", line 205, in _cache_lookup
    return cls._cache[key]
KeyError: ('9d856de341689e4f60b3a9709f8cabd1', False, False, False, <class 'pyop2.sequential.Arg'>, (3,), dtype('float64'), 20, (<class 'pyop2.base.IterationIndex'>, 0), None, Access('WRITE'), <class 'pyop2.sequential.Arg'>, (3,), dtype('float64'), 4, (<class 'pyop2.base.IterationIndex'>, 0), None, Access('READ'), <class 'pyop2.sequential.Arg'>, (1,), dtype('float64'), Access('READ'))

This error will then be followed by a bunch of "During handling of the above exception, another exception occurred" errors. I have attached the full output from one of these error messages.

Do you have any idea what could be going wrong here? Please let me know if you need more information.

Thanks,

Aaron

Re: [firedrake] Errors running Firedrake on multiple nodes

Homolya, Miklós