DMPlex in Firedrake: scaling of mesh distribution
To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev: Is it expected for mesh distribution step to (A) take a share of 50-99% of total time-to-solution of an FEM problem, and (B) take an amount of time that increases with the number of ranks, and (C) take an amount of memory on rank 0 that does not decrease with the number of ranks ? The attached plots suggest (A), (B), and (C) is happening for Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K unit-square mesh. The implementation is here [1]. Versions are Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3. Two questions, one on (A) and the other on (B)+(C): 1. Is (A) result expected? Given (A), any effort to improve the quality of the compiled assembly kernels (or anything else other than mesh distribution) appears futile since it takes 1% of end-to-end execution time, or am I missing something? 1a. Is mesh distribution fundamentally necessary for any FEM framework, or is it only needed by Firedrake? If latter, then how do other frameworks partition the mesh and execute in parallel with MPI but avoid the non-scalable mesh destribution step? 2. Results (B) and (C) suggest that the mesh distribution step does not scale. Is it a fundamental property of the mesh distribution problem that it has a central bottleneck in the master process, or is it a limitation of the current implementation in PETSc-DMPlex? 2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2] suggests a way to reduce the time spent on sequential bottleneck by "parallel mesh refinment" that creates high-resolution meshes from an initial coarse mesh. Is this approach implemented in DMPLex? If so, any pointers on how to try it out with Firedrake? If not, any other directions for reducing this bottleneck? 2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well up to 96 cores -- is mesh distribution included in those times? Is anyone reading this aware of any other publications with evaluations of Firedrake that measure mesh distribution (or explain how to avoid or exclude it)? Thank you for your time and any info or tips. [1] https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedr... [2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. Knepley, Michael Lange, Gerard J. Gorman, 2015. https://arxiv.org/pdf/1506.06194.pdf [3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749
Alexei, Sorry to hear about your difficulties with the mesh distribution step. Based on the figures you included, the extreme difficulties occur on the Oak Ridge Summit system? But on the Argonne Theta system though the distribution time goes up it does not dominate the computation? There is a very short discussion of mesh distribution in https://arxiv.org/abs/2102.13018 <https://arxiv.org/abs/2102.13018> section 6.3 Figure 11 that was run on the Summit system. Certainly there is no intention that mesh distribution dominates the entire computation, your particular case and the behavior of DMPLEX would need to understood on Summit to determine the problems. DMPLEX and all its related components are rapidly evolving and hence performance can change quickly with new updates. I urge you to use the main branch of PETSc for HPC and timing studies of DMPLEX performance on large systems, do not just use PETSc releases. You can communicate directly with those working on scaling DMPLEX at gitlab.com/petsc/petsc <http://gitlab.com/petsc/petsc> who can help understand the cause of the performance issues on Summit. Barry
On Mar 5, 2021, at 3:06 PM, Alexei Colin <acolin@isi.edu> wrote:
To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev:
Is it expected for mesh distribution step to (A) take a share of 50-99% of total time-to-solution of an FEM problem, and (B) take an amount of time that increases with the number of ranks, and (C) take an amount of memory on rank 0 that does not decrease with the number of ranks ?
The attached plots suggest (A), (B), and (C) is happening for Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K unit-square mesh. The implementation is here [1]. Versions are Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3.
Two questions, one on (A) and the other on (B)+(C):
1. Is (A) result expected? Given (A), any effort to improve the quality of the compiled assembly kernels (or anything else other than mesh distribution) appears futile since it takes 1% of end-to-end execution time, or am I missing something?
1a. Is mesh distribution fundamentally necessary for any FEM framework, or is it only needed by Firedrake? If latter, then how do other frameworks partition the mesh and execute in parallel with MPI but avoid the non-scalable mesh destribution step?
2. Results (B) and (C) suggest that the mesh distribution step does not scale. Is it a fundamental property of the mesh distribution problem that it has a central bottleneck in the master process, or is it a limitation of the current implementation in PETSc-DMPlex?
2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2] suggests a way to reduce the time spent on sequential bottleneck by "parallel mesh refinment" that creates high-resolution meshes from an initial coarse mesh. Is this approach implemented in DMPLex? If so, any pointers on how to try it out with Firedrake? If not, any other directions for reducing this bottleneck?
2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well up to 96 cores -- is mesh distribution included in those times? Is anyone reading this aware of any other publications with evaluations of Firedrake that measure mesh distribution (or explain how to avoid or exclude it)?
Thank you for your time and any info or tips.
[1] https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedr...
[2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. Knepley, Michael Lange, Gerard J. Gorman, 2015. https://arxiv.org/pdf/1506.06194.pdf
[3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749 <ch-mesh-dist.pdf><ch-mem.pdf>
Dear Alexei, I echo the comments that Barry and others have made. Some more in line below.
On 5 Mar 2021, at 21:06, Alexei Colin <acolin@isi.edu> wrote:
To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev:
Is it expected for mesh distribution step to (A) take a share of 50-99% of total time-to-solution of an FEM problem, and
We hope not!
(B) take an amount of time that increases with the number of ranks, and (C) take an amount of memory on rank 0 that does not decrease with the number of ranks
This is a consequence, as Matt notes, of us making a serial mesh and then doing a one to all distribution.
1a. Is mesh distribution fundamentally necessary for any FEM framework, or is it only needed by Firedrake? If latter, then how do other frameworks partition the mesh and execute in parallel with MPI but avoid the non-scalable mesh destribution step?
Matt points out that we should do something smarter (namely make and distribute a small mesh from serial to parallel and then do refinement and repartitioning in parallel). This is not implemented out of the box, but here is some code that (in up to date Firedrake/petsc) does that from firedrake import * from firedrake.cython.dmcommon import CELL_SETS_LABEL, FACE_SETS_LABEL from firedrake.cython.mgimpl import filter_labels from firedrake.petsc import PETSc # Create a small mesh that is cheap to distribute. mesh = UnitSquareMesh(10, 10) dm = mesh.topology_dm dm.setRefinementUniform(True) # Refine it a bunch of times, edge midpoint division. rdm = dm.refine() rdm = rdm.refine() rdm = rdm.refine() # Remove some labels that will be reconstructed. filter_labels(rdm, rdm.getHeightStratum(1), "exterior_facets", "boundary_faces", FACE_SETS_LABEL) filter_labels(rdm, rdm.getHeightStratum(0), CELL_SETS_LABEL) for label in ["interior_facets", "pyop2_core", "pyop2_owned", "pyop2_ghost"]: rdm.removeLabel(label) # Redistributed for better load balanced partitions (this is in parallel). rdm.distribute() # Now make the firedrake mesh object. rmesh = Mesh(rdm, distribution_parameters={"partition": False}) # Now do things in parallel. This is probably something we should push into the library (it's quite fiddly!), so if you can try it out easily and check that it works please let us know! Thanks, Lawrence
participants (3)
- 
                
                Alexei Colin
- 
                
                Barry Smith
- 
                
                Lawrence Mitchell