Is phase 1 the old method and 2 the new? Is this 128^3 mesh per process? On Sun, Mar 7, 2021 at 7:27 AM Stefano Zampini <stefano.zampini@gmail.com> wrote:
[2] On the robustness and performance of entropy stable discontinuous
collocation methods for the compressible Navier-Stokes equations, ROjas . et.al. https://arxiv.org/abs/1911.10966
This is not the proper reference, here is the correct one https://www.sciencedirect.com/science/article/pii/S0021999120306185?dgcid=rs... However, there the algorithm is only outlined, and performances related to the mesh distribution are not really reported. We observed a large gain for large core counts and one to all distributions (from minutes to seconds) by splitting the several communication rounds needed by DMPlex into stages: from rank 0 to 1 rank per node, and then decomposing independently within the node. Attached the total time for one-to-all DMPlexDistrbute for a 128^3 mesh
?
The attached plots suggest (A), (B), and (C) is happening for Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K unit-square mesh. The implementation is here [1]. Versions are Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3.
Two questions, one on (A) and the other on (B)+(C):
1. Is (A) result expected? Given (A), any effort to improve the quality of the compiled assembly kernels (or anything else other than mesh distribution) appears futile since it takes 1% of end-to-end execution time, or am I missing something?
1a. Is mesh distribution fundamentally necessary for any FEM framework, or is it only needed by Firedrake? If latter, then how do other frameworks partition the mesh and execute in parallel with MPI but avoid the non-scalable mesh destribution step?
2. Results (B) and (C) suggest that the mesh distribution step does not scale. Is it a fundamental property of the mesh distribution problem that it has a central bottleneck in the master process, or is it a limitation of the current implementation in PETSc-DMPlex?
2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2] suggests a way to reduce the time spent on sequential bottleneck by "parallel mesh refinment" that creates high-resolution meshes from an initial coarse mesh. Is this approach implemented in DMPLex? If so, any pointers on how to try it out with Firedrake? If not, any other directions for reducing this bottleneck?
2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well up to 96 cores -- is mesh distribution included in those times? Is anyone reading this aware of any other publications with evaluations of Firedrake that measure mesh distribution (or explain how to avoid or exclude it)?
Thank you for your time and any info or tips.
[1] https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedr...
[2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. Knepley, Michael Lange, Gerard J. Gorman, 2015. https://arxiv.org/pdf/1506.06194.pdf
[3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749
-- Stefano