New subject: Problem with petsc seg faults in firedrake

4 May 2017

      Hi Paul,

Do the environment variables PYOP2_CACHE_DIR and
FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible
by all nodes, as described at
https://github.com/firedrakeproject/firedrake/wiki/Archer ?  They default
to somewhere in /tmp, which is often node-local storage.

Not sure why this would cause a segfault, but this is something likely to
break things when going from 1 node to multiple.

On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu> wrote:
...
Hello firedrake,
I’m having problems running a firedrake test script across multiple nodes.
Petsc crashes with seg faults when using more than one node but the script
runs fine when restricted to a single node.
The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu
 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using
straight mpirun and a hostfile. I’ve run openmpi and openib tests and they
indicate no problems with either subsystem. Firedrake installs cleanly.
pturner@ubuntu-0-0:~/mpitest$ which python
/home/pturner/firedrake/bin/python
...
...
...
print firedrake.__version__
0.13.0+1303.g9070020
Test simple firedrake script using mpirun
pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py
from firedrake import *
mesh = UnitSquareMesh(10, 10)
p1 = FunctionSpace(mesh, 'CG', 1)
f = Function(p1, name='function')
x, y = SpatialCoordinate(mesh)
expr = sin(2*pi*x)*(1 + y)
f.project(expr)
n = norm(f)
if mesh.comm.rank == 0:
print('Norm {:}'.format(n))
print('SUCCESS')
Execute with 16 cores (single node):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca
mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py
Norm 1.07999261002
SUCCESS
Execute with 32 cores (2 nodes):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca
mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py
Consistent failures on ranks 4 and 20:
[20]PETSC ERROR: ------------------------------
------------------------------------------
[20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[...]
[4]PETSC ERROR: ------------------------------
------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[...]
Any idea what might be going wrong?
Thx,
--Paul
Paul J Turner
OHSU/CMOP

Re: [firedrake] Problem with petsc seg faults in firedrake

Andrew McRae

Andrew McRae

Paul Turner

tags

participants (2)