Another reason might be that your mesh (UnitSquareMesh(10, 10)) is simply too small to be parallelised across 32 cores, and some MPI processes are ending up with zero mesh. On 4 May 2017 at 19:28, Andrew McRae <A.T.T.McRae@bath.ac.uk> wrote:
Hi Paul,
Do the environment variables PYOP2_CACHE_DIR and FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible by all nodes, as described at https://github.com/ firedrakeproject/firedrake/wiki/Archer ? They default to somewhere in /tmp, which is often node-local storage.
Not sure why this would cause a segfault, but this is something likely to break things when going from 1 node to multiple.
On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu> wrote:
Hello firedrake,
I’m having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node.
The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I’ve run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly.
pturner@ubuntu-0-0:~/mpitest$ which python
/home/pturner/firedrake/bin/python
print firedrake.__version__
0.13.0+1303.g9070020
Test simple firedrake script using mpirun
pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py
from firedrake import *
mesh = UnitSquareMesh(10, 10)
p1 = FunctionSpace(mesh, 'CG', 1)
f = Function(p1, name='function')
x, y = SpatialCoordinate(mesh)
expr = sin(2*pi*x)*(1 + y)
f.project(expr)
n = norm(f)
if mesh.comm.rank == 0:
print('Norm {:}'.format(n))
print('SUCCESS')
Execute with 16 cores (single node):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py
Norm 1.07999261002
SUCCESS
Execute with 32 cores (2 nodes):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py
Consistent failures on ranks 4 and 20:
[20]PETSC ERROR: ------------------------------ ------------------------------------------
[20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
[4]PETSC ERROR: ------------------------------ ------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
Any idea what might be going wrong?
Thx,
--Paul
Paul J Turner
OHSU/CMOP