Hi Paul,

Do the environment variables PYOP2_CACHE_DIR and FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible by all nodes, as described at https://github.com/firedrakeproject/firedrake/wiki/Archer ?  They default to somewhere in /tmp, which is often node-local storage.

Not sure why this would cause a segfault, but this is something likely to break things when going from 1 node to multiple.

On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu> wrote:

Hello firedrake,

 

I’m having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node.

 

The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu  16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I’ve run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly.

 

pturner@ubuntu-0-0:~/mpitest$ which python

/home/pturner/firedrake/bin/python

 

>>> print firedrake.__version__

0.13.0+1303.g9070020

 

Test simple firedrake script using mpirun

 

pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py

from firedrake import *

 

mesh = UnitSquareMesh(10, 10)

p1 = FunctionSpace(mesh, 'CG', 1)

 

f = Function(p1, name='function')

 

x, y = SpatialCoordinate(mesh)

expr = sin(2*pi*x)*(1 + y)

f.project(expr)

 

n = norm(f)

if mesh.comm.rank == 0:

   print('Norm {:}'.format(n))

   print('SUCCESS')

 

Execute with 16 cores (single node):

 

pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py

Norm 1.07999261002

SUCCESS

 

Execute with 32 cores (2 nodes):

 

pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py

 

Consistent failures on ranks 4 and 20:

 

[20]PETSC ERROR: ------------------------------------------------------------------------

[20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

[...]

[4]PETSC ERROR: ------------------------------------------------------------------------

[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

[...]

 

Any idea what might be going wrong?

 

Thx,

 

--Paul

 

Paul J Turner

OHSU/CMOP