Re: [firedrake] Problem with petsc seg faults in firedrake
Hi Paul, Do the environment variables PYOP2_CACHE_DIR and FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible by all nodes, as described at https://github.com/firedrakeproject/firedrake/wiki/Archer ? They default to somewhere in /tmp, which is often node-local storage. Not sure why this would cause a segfault, but this is something likely to break things when going from 1 node to multiple. On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu> wrote:
Hello firedrake,
I’m having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node.
The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I’ve run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly.
pturner@ubuntu-0-0:~/mpitest$ which python
/home/pturner/firedrake/bin/python
print firedrake.__version__
0.13.0+1303.g9070020
Test simple firedrake script using mpirun
pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py
from firedrake import *
mesh = UnitSquareMesh(10, 10)
p1 = FunctionSpace(mesh, 'CG', 1)
f = Function(p1, name='function')
x, y = SpatialCoordinate(mesh)
expr = sin(2*pi*x)*(1 + y)
f.project(expr)
n = norm(f)
if mesh.comm.rank == 0:
print('Norm {:}'.format(n))
print('SUCCESS')
Execute with 16 cores (single node):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py
Norm 1.07999261002
SUCCESS
Execute with 32 cores (2 nodes):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py
Consistent failures on ranks 4 and 20:
[20]PETSC ERROR: ------------------------------ ------------------------------------------
[20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
[4]PETSC ERROR: ------------------------------ ------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
Any idea what might be going wrong?
Thx,
--Paul
Paul J Turner
OHSU/CMOP
Another reason might be that your mesh (UnitSquareMesh(10, 10)) is simply too small to be parallelised across 32 cores, and some MPI processes are ending up with zero mesh. On 4 May 2017 at 19:28, Andrew McRae <A.T.T.McRae@bath.ac.uk> wrote:
Hi Paul,
Do the environment variables PYOP2_CACHE_DIR and FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible by all nodes, as described at https://github.com/ firedrakeproject/firedrake/wiki/Archer ? They default to somewhere in /tmp, which is often node-local storage.
Not sure why this would cause a segfault, but this is something likely to break things when going from 1 node to multiple.
On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu> wrote:
Hello firedrake,
I’m having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node.
The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I’ve run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly.
pturner@ubuntu-0-0:~/mpitest$ which python
/home/pturner/firedrake/bin/python
print firedrake.__version__
0.13.0+1303.g9070020
Test simple firedrake script using mpirun
pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py
from firedrake import *
mesh = UnitSquareMesh(10, 10)
p1 = FunctionSpace(mesh, 'CG', 1)
f = Function(p1, name='function')
x, y = SpatialCoordinate(mesh)
expr = sin(2*pi*x)*(1 + y)
f.project(expr)
n = norm(f)
if mesh.comm.rank == 0:
print('Norm {:}'.format(n))
print('SUCCESS')
Execute with 16 cores (single node):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py
Norm 1.07999261002
SUCCESS
Execute with 32 cores (2 nodes):
pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py
Consistent failures on ranks 4 and 20:
[20]PETSC ERROR: ------------------------------ ------------------------------------------
[20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
[4]PETSC ERROR: ------------------------------ ------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[...]
Any idea what might be going wrong?
Thx,
--Paul
Paul J Turner
OHSU/CMOP
Thx, I went with UnitSquareMesh(100, 100) and 18 cores. I was not using the caching environment variables as you indicated and in setting those to point to a shared file system I do get cache items in both directories when running on a single node but crashes when using more than one node. I’ll have to dig deeper – I need a stack trace, no? I’m not sure how to do that, I’m new at this. --Paul Paul J Turner OHSU/CMOP From: firedrake-bounces@imperial.ac.uk [mailto:firedrake-bounces@imperial.ac.uk] On Behalf Of Andrew McRae Sent: Thursday, May 04, 2017 11:33 AM To: firedrake@imperial.ac.uk Subject: Re: [firedrake] Problem with petsc seg faults in firedrake Another reason might be that your mesh (UnitSquareMesh(10, 10)) is simply too small to be parallelised across 32 cores, and some MPI processes are ending up with zero mesh. On 4 May 2017 at 19:28, Andrew McRae <A.T.T.McRae@bath.ac.uk<mailto:A.T.T.McRae@bath.ac.uk>> wrote: Hi Paul, Do the environment variables PYOP2_CACHE_DIR and FIREDRAKE_TSFC_KERNEL_CACHE_DIR point to a filesystem location accessible by all nodes, as described at https://github.com/firedrakeproject/firedrake/wiki/Archer ? They default to somewhere in /tmp, which is often node-local storage. Not sure why this would cause a segfault, but this is something likely to break things when going from 1 node to multiple. On 4 May 2017 at 19:21, Paul Turner <turnerpa@ohsu.edu<mailto:turnerpa@ohsu.edu>> wrote: Hello firedrake, I’m having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node. The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I’ve run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly. pturner@ubuntu-0-0:~/mpitest$ which python /home/pturner/firedrake/bin/python
print firedrake.__version__ 0.13.0+1303.g9070020
Test simple firedrake script using mpirun pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py from firedrake import * mesh = UnitSquareMesh(10, 10) p1 = FunctionSpace(mesh, 'CG', 1) f = Function(p1, name='function') x, y = SpatialCoordinate(mesh) expr = sin(2*pi*x)*(1 + y) f.project(expr) n = norm(f) if mesh.comm.rank == 0: print('Norm {:}'.format(n)) print('SUCCESS') Execute with 16 cores (single node): pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py Norm 1.07999261002 SUCCESS Execute with 32 cores (2 nodes): pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py Consistent failures on ranks 4 and 20: [20]PETSC ERROR: ------------------------------------------------------------------------ [20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [...] [4]PETSC ERROR: ------------------------------------------------------------------------ [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [...] Any idea what might be going wrong? Thx, --Paul Paul J Turner OHSU/CMOP
participants (2)
- 
                
                Andrew McRae
- 
                
                Paul Turner