Using the above suggested workaround may seem to fail for poorly formulated time-stepping methods. Depending on how time-dependent parameters are formulated, it is possible that form compilation is triggered in every time step. To check whether this happens, you may increase log level by inserting

set_log_level(logging.INFO)

after the Firedrake import statement. If you see repeated calls to the form compiler in each time step, that's a sign that your problem formulation should be improved.

Furthermore, you said you use a system provided PETSc build. This most likely has nothing to do with your present problem, but could be a source of further problems, for two reasons:

1) PETSc version. Release version of PETSc are often too old for Firedrake, while the master version might be broken. Firedrake maintains a snapshot of PETSc master that is tested and known to work with Firedrake.

2) PETSc configuration. Firedrake assumes that certain features, provided by external packages, are available through PETSc. This may or may not be the case depending on how PETSc was configured. firedrake-install can tell you what switches it passes to PETSc configure.

Hello,

The relevant message seems to be this:

A process has executed an operation involving a call to the "fork()" system call to create a child process.  Open MPI is currently operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent data corruption.  The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

Firedrake compiles the high-level specification of your PDE into efficient binary code dynamically, just in time. It first generates C code, that calls the host C compiler to compile that. This step fails because your MPI does not support the fork() system call.

Suggested workaround is to warm up your disk cache by first solving a small problem serially (or on a single node if that works), and then launch your parallel runs with warm cache, so Firedrake won't need to do further compilations.

Regards,

Miklos

From: firedrake-bounces@imperial.ac.uk <firedrake-bounces@imperial.ac.uk> on behalf of Aaron Matthew Baier-Reinio <ambaierreinio@edu.uwaterloo.ca>
Sent: 13 July 2018 19:15:51
To: firedrake
Subject: [firedrake] Errors running Firedrake on multiple nodes

Hello Firedrake team,

I am trying to install Firedrake on a cluster and use mpi to run my Firedrake scripts on multiple nodes. I am running into some issues and was wondering if you would be able to help.

The operating system on the cluster is CentOS 7.4. Both Python 3.5 and PETSc 3.9.0 are already loaded on the cluster via the environment modules tool. So, to install I am running the install script as follows:

python3 firedrake-install --disable-ssh --no-package-manager --honour-petsc-dir --honour-pythonpath

The install seems to go through successfully. The resultant installation seems to work when running on a single node, but I am getting errors when I run on multiple nodes. If I run my script as

mpirun -np 8 python3 script.py

and all 8 CPUs are located on the same node, then the script seems to run successfully. However, if the CPUs are distributed across multiple nodes, then I always get error messages. The general pattern to these error messages is that I will first get a PyOP2 caching error that looks like this:

Traceback (most recent call last):

File "/home/ambaierr/firedrake/src/PyOP2/pyop2/caching.py", line 197, in __new__

return cls._cache_lookup(key)

File "/home/ambaierr/firedrake/src/PyOP2/pyop2/caching.py", line 205, in _cache_lookup

return cls._cache[key]

KeyError: ('9d856de341689e4f60b3a9709f8cabd1', False, False, False, <class 'pyop2.sequential.Arg'>, (3,), dtype('float64'), 20, (<class 'pyop2.base.IterationIndex'>, 0), None, Access('WRITE'), <class 'pyop2.sequential.Arg'>, (3,), dtype('float64'), 4, (<class 'pyop2.base.IterationIndex'>, 0), None, Access('READ'), <class 'pyop2.sequential.Arg'>, (1,), dtype('float64'), Access('READ'))

This error will then be followed by a bunch of "During handling of the above exception, another exception occurred" errors. I have attached the full output from one of these error messages.

Do you have any idea what could be going wrong here? Please let me know if you need more information.

Thanks,

Aaron