Firedrake not strong-scaling as well as a plain PETSc program?
Hi all, I am not sure if this is a firedrake issue/question per se, but I noticed that when I run PETSc's SNES ex12 (3D FEM poisson with DMPlex) with 531,441 dofs, i get the following speedup for KSPSolve(...) on up to 8 cores: 1: 4.4628 s 2: 2.0991 s 4: 1.0930 s 8: 0.6591 s That's roughly 85 % parallel efficiency. Now, when I run a similar firedrake 3D version: 1: 7.1377 s 2: 3.6969 s 4: 2.0406 s 8: 1.2939 s That's now almost 69% parallel efficiency. I used the same solver and preconditioner (GMRE with ML). The PETSC_DIR I am using for the SNES ex12 is /path/to/firedrake/lib/python2.7/site-packages/petsc so I am basically using the same petsc and compilers for both cases. Using OpenMPI-1.6.5 with the binding options "--bind-to-core --bysocket" on an Intel Xeon E5 2670 node. The efficiency for firedrake gets worse when I scale up the problem and use up to 64 cores (8 cores and 8 nodes). I get as bad as 40 % efficiency when I still maintain roughly 75% for the PETSc case. This is rather strange. Shouldn't I expect roughly the same scaling performance for both implementations? Or is this normal? Note that I disregarded the assembly of the Jacobian or Function for either case because firedrake is much faster that Matt's PetscFE. I understand that the more "serially efficient" a code is, the less parallel efficient it may be. Might this scaling issue have to do with python and/or the implementation of firedrake? FWIW, I installed my own python library because the HPC machine does not have a compatible python 2.7 Thanks, Justin
Can you send -log_view for both cases? Lawrence
On 17 Aug 2016, at 21:45, Justin Chang <jychang48@gmail.com> wrote:
Hi all,
I am not sure if this is a firedrake issue/question per se, but I noticed that when I run PETSc's SNES ex12 (3D FEM poisson with DMPlex) with 531,441 dofs, i get the following speedup for KSPSolve(...) on up to 8 cores:
1: 4.4628 s 2: 2.0991 s 4: 1.0930 s 8: 0.6591 s
That's roughly 85 % parallel efficiency. Now, when I run a similar firedrake 3D version:
1: 7.1377 s 2: 3.6969 s 4: 2.0406 s 8: 1.2939 s
That's now almost 69% parallel efficiency. I used the same solver and preconditioner (GMRE with ML). The PETSC_DIR I am using for the SNES ex12 is /path/to/firedrake/lib/python2.7/site-packages/petsc so I am basically using the same petsc and compilers for both cases. Using OpenMPI-1.6.5 with the binding options "--bind-to-core --bysocket" on an Intel Xeon E5 2670 node.
The efficiency for firedrake gets worse when I scale up the problem and use up to 64 cores (8 cores and 8 nodes). I get as bad as 40 % efficiency when I still maintain roughly 75% for the PETSc case.
This is rather strange. Shouldn't I expect roughly the same scaling performance for both implementations? Or is this normal? Note that I disregarded the assembly of the Jacobian or Function for either case because firedrake is much faster that Matt's PetscFE.
I understand that the more "serially efficient" a code is, the less parallel efficient it may be. Might this scaling issue have to do with python and/or the implementation of firedrake? FWIW, I installed my own python library because the HPC machine does not have a compatible python 2.7
Thanks, Justin _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
Attached are the log_views for all 8 cases (4 petsc and 4 firedrake). The KSPSolve timings for the petsc runs this time are slightly difference because I ensured that the solver tolerances and parameters matched that of the firedrake case. Still, i see roughly 81% efficiency compared to 69%. I also used an extruded mesh for firedrake (with 2D triangles as base) whereas PETSc used the DMPlexCreateBoxMesh(...) Thanks, Justin On Wed, Aug 17, 2016 at 4:14 PM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
Can you send -log_view for both cases?
Lawrence
On 17 Aug 2016, at 21:45, Justin Chang <jychang48@gmail.com> wrote:
Hi all,
I am not sure if this is a firedrake issue/question per se, but I noticed that when I run PETSc's SNES ex12 (3D FEM poisson with DMPlex) with 531,441 dofs, i get the following speedup for KSPSolve(...) on up to 8 cores:
1: 4.4628 s 2: 2.0991 s 4: 1.0930 s 8: 0.6591 s
That's roughly 85 % parallel efficiency. Now, when I run a similar firedrake 3D version:
1: 7.1377 s 2: 3.6969 s 4: 2.0406 s 8: 1.2939 s
That's now almost 69% parallel efficiency. I used the same solver and preconditioner (GMRE with ML). The PETSC_DIR I am using for the SNES ex12 is /path/to/firedrake/lib/python2.7/site-packages/petsc so I am basically using the same petsc and compilers for both cases. Using OpenMPI-1.6.5 with the binding options "--bind-to-core --bysocket" on an Intel Xeon E5 2670 node.
The efficiency for firedrake gets worse when I scale up the problem and use up to 64 cores (8 cores and 8 nodes). I get as bad as 40 % efficiency when I still maintain roughly 75% for the PETSc case.
This is rather strange. Shouldn't I expect roughly the same scaling performance for both implementations? Or is this normal? Note that I disregarded the assembly of the Jacobian or Function for either case because firedrake is much faster that Matt's PetscFE.
I understand that the more "serially efficient" a code is, the less parallel efficient it may be. Might this scaling issue have to do with python and/or the implementation of firedrake? FWIW, I installed my own python library because the HPC machine does not have a compatible python 2.7
Thanks, Justin _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
On 17/08/16 23:37, Justin Chang wrote:
Attached are the log_views for all 8 cases (4 petsc and 4 firedrake).
The KSPSolve timings for the petsc runs this time are slightly difference because I ensured that the solver tolerances and parameters matched that of the firedrake case. Still, i see roughly 81% efficiency compared to 69%. I also used an extruded mesh for firedrake (with 2D triangles as base) whereas PETSc used the DMPlexCreateBoxMesh(...)
Hi Justin, I realise I should have asked for -snes_view as well. I notice the problems are not quite set up the same, since the petsc version uses -snes_type newtonls, whereas the firedrake one uses ksponly. I guess there are a number of things that are going on. The connectivity is different between triangular prisms and tetrahedra for one. You could compare closer things if you used a UnitCubeMesh in firedrake. However, the dof-ordering will still not be the same, and this may have an effect on the resulting AMG hierarchy (and hence the scalability). As you note, basically all the scaling issues come from application of the ML preconditioner. That may need a bit of tuning. Its application will be bandwidth limited, so the dof ordering and how fast things coarsen will likely have a big effect. Cheers, Lawrence
Lawrence, So yes I am comparing apples to oranges here. The reason I wanted extruded meshes was because the mesh generation from Tetgen is extremely slow. I guess I'll just have to play around with these implementations a little more. Thanks, Justin On Thu, Aug 18, 2016 at 9:47 AM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
On 17/08/16 23:37, Justin Chang wrote:
Attached are the log_views for all 8 cases (4 petsc and 4 firedrake).
The KSPSolve timings for the petsc runs this time are slightly difference because I ensured that the solver tolerances and parameters matched that of the firedrake case. Still, i see roughly 81% efficiency compared to 69%. I also used an extruded mesh for firedrake (with 2D triangles as base) whereas PETSc used the DMPlexCreateBoxMesh(...)
Hi Justin,
I realise I should have asked for -snes_view as well.
I notice the problems are not quite set up the same, since the petsc version uses -snes_type newtonls, whereas the firedrake one uses ksponly.
I guess there are a number of things that are going on. The connectivity is different between triangular prisms and tetrahedra for one. You could compare closer things if you used a UnitCubeMesh in firedrake. However, the dof-ordering will still not be the same, and this may have an effect on the resulting AMG hierarchy (and hence the scalability). As you note, basically all the scaling issues come from application of the ML preconditioner. That may need a bit of tuning. Its application will be bandwidth limited, so the dof ordering and how fast things coarsen will likely have a big effect.
Cheers,
Lawrence
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
participants (2)
- 
                
                Justin Chang
- 
                
                Lawrence Mitchell