Problem with petsc seg faults in firedrake
Hello firedrake, I'm having problems running a firedrake test script across multiple nodes. Petsc crashes with seg faults when using more than one node but the script runs fine when restricted to a single node. The cluster has 4 nodes where each node has 16 cores/64GB mem with Ubuntu 16.04.2 LTS and IB interconnects. We use SLURM but the errors occur using straight mpirun and a hostfile. I've run openmpi and openib tests and they indicate no problems with either subsystem. Firedrake installs cleanly. pturner@ubuntu-0-0:~/mpitest$ which python /home/pturner/firedrake/bin/python
print firedrake.__version__ 0.13.0+1303.g9070020
Test simple firedrake script using mpirun pturner@ubuntu-0-0:~/mpitest$ cat firedrake_proj.py from firedrake import * mesh = UnitSquareMesh(10, 10) p1 = FunctionSpace(mesh, 'CG', 1) f = Function(p1, name='function') x, y = SpatialCoordinate(mesh) expr = sin(2*pi*x)*(1 + y) f.project(expr) n = norm(f) if mesh.comm.rank == 0: print('Norm {:}'.format(n)) print('SUCCESS') Execute with 16 cores (single node): pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 16 -hostfile ~/hostfile python firedrake_proj.py Norm 1.07999261002 SUCCESS Execute with 32 cores (2 nodes): pturner@ubuntu-0-0:~/mpitest$ mpirun --mca btl openib,sm,self --mca mpi_warn_on_fork 0 -n 32 -hostfile ~/hostfile python firedrake_proj.py Consistent failures on ranks 4 and 20: [20]PETSC ERROR: ------------------------------------------------------------------------ [20]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [...] [4]PETSC ERROR: ------------------------------------------------------------------------ [4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [...] Any idea what might be going wrong? Thx, --Paul Paul J Turner OHSU/CMOP
participants (1)
-
Paul Turner