Hello Firedrake team,
I am trying to install Firedrake on a cluster and use mpi to run my Firedrake scripts on multiple nodes. I am running into some issues and was wondering if you would be able to help.
The operating system on the cluster is CentOS 7.4. Both Python 3.5 and PETSc 3.9.0 are already loaded on the cluster via the environment modules tool. So, to install I am running the install script as follows:
python3 firedrake-install --disable-ssh --no-package-manager --honour-petsc-dir --honour-pythonpath
The install seems to go through successfully. The resultant installation seems to work when running on a single node, but I am getting errors when I run on multiple nodes. If I run my script as
mpirun -np 8 python3 script.py
and all 8 CPUs are located on the same node, then the script seems to run successfully. However, if the CPUs are distributed across multiple nodes, then I always get error messages. The general pattern to these error messages is that I will first get a PyOP2 caching error that looks like this:
This error will then be followed by a bunch of "During handling of the above exception, another exception occurred" errors. I have attached the full output from one of these error messages.
Do you have any idea what could be going wrong here? Please let me know if you need more information.
Thanks,
Aaron