Hello, I'm trying to reproduce the firedrake strong scaling experiments, as reported in this paper <https://arxiv.org/pdf/1501.01809.pdf>. However, I'm unable to reproduce similar scaling trends in my experiments. Here are some more details: *Machine:* I'm on TACC's stampede2, and running experiments on the skylake nodes. Each node has 2 sockets, and 24 physical cores on each socket. *PETSc:* The firedrake fork of PETSc was compiled with the 2017 intel compilers and the -O3 and skylake specific -xCORE-AVX512 flags. *Problem:* The problem is the poisson equation in 3D. The preconditioner and linear solver are again the same as in the paper. At each core count, each problem was run twice and only the second run was timed. *Results:* I'm attaching a strong scaling plot. This plot was created using 80x80x80 cubes in the mesh, and 3rd order continuous polynomials as the finite element function space. This results in ~14M DOFs in total. The plot shows the maximum time spent solving the pre-assembled linear system, as measured by wrapping MPI timers around firedrake.solve(A, u, solver_parameters=param). Also reported is the time spent in the KSPSolve event. While the KSPSolve event scales well, the call to firedrake.solve doesn't. It seems like operation that occurs between firedrake.solve and before ksp.solve is causing the bottleneck. I'm also attaching the python script I used to solve the Poisson equation, and a sample output from PETSc's -log_view option. Any hints as to why this isn't scaling will be appreciated!
Dear Siddhant, comments inline below. On 22/05/18 05:05, Siddhant Wahal wrote:
Hello,
I'm trying to reproduce the firedrake strong scaling experiments, as reported in this paper <https://arxiv.org/pdf/1501.01809.pdf>. However, I'm unable to reproduce similar scaling trends in my experiments.
Some of the details here are a little subtle, but I'll try my best to explain.
* The problem is the poisson equation in 3D. The preconditioner and linear solver are again the same as in the paper. At each core count, each problem was run twice and only the second run was timed.
If I look in the code, I see that you run a solve once. So presumably you mean you do: mpiexec -n ... python poisson.py mpiexec -n ... python poisson.py and only measure timings from the second run?
*Results:*
I'm attaching a strong scaling plot. This plot was created using 80x80x80 cubes in the mesh, and 3rd order continuous polynomials as the finite element function space. This results in ~14M DOFs in total.
The plot shows the maximum time spent solving the pre-assembled linear system, as measured by wrapping MPI timers around firedrake.solve(A, u, solver_parameters=param). Also reported is the time spent in the KSPSolve event. While the KSPSolve event scales well, the call to firedrake.solve doesn't. It seems like operation that occurs between firedrake.solve and before ksp.solve is causing the bottleneck.
So, the first thing to note here is that (for various implementation reasons) calls to `assemble` do not necessarily do the compute at that moment. Rather, they return an object with a promise that "when you need to use the data" the values will be computed. So, in your code, you do (in pseudo-code): with mat_timer: A = assemble(a, bcs=bc) with rhs_timer: b = assemble(L) # incidentally, you don't need to apply the boundary conditions to # the rhs here, if a matrix has bcs attached, the solve will take # care to apply them correctly. bc.apply(b) with solve_timer: solve(A, u, b) The first two steps complete almost immediately (probably independent of the problem size), because they have just queued up some computation. Now, when you come to solve the system, we know that you actually want the matrix and right hand side to be computed, so before handing it to the KSP, we actually perform the computation. So, that would explain why the "solve" time is so much larger than the "KSPSolve" time. It doesn't yet explain why you don't get the same kind of strong scaling performance observed in the paper. To explain this, we need to understand a little of what goes on when you call assemble. There are two compilation stages from the high-level UFL form to the code that executes the assembly: 1. compile UFL to an element kernel; 2. compile code that executes this element kernel over the mesh. The latter potentially invokes the C compiler. To avoid calling the compiler too often, we remember the compiled code on disk (so that the next time we see it, we don't compile again). Moreover, when you assemble the same form again (in the same run), we have already dynamically loaded the relevant code (so we don't need to touch the disk again). It is quite possible that what you're seeing in your timings is the cost of touching the disk to find the code we need to execute. Especially on a supercomputer with a # shared filesystem, on many processes this can be somewhat slower than ideal. In the Firedrake paper, we measure timings without these extra costs. The rationale for doing this is that although you might be reporting scaling for a single poisson solve, in all likelihood, you will probably need to do many such solves in your actual simulation. Hence, ignoring the amortized "setup" time is a reasonable thing to do. For your code we can do this like so: Add the additional parameter: parameters["pyop2_options"]["lazy_evaluation"] = False To time the matrix and rhs assembly: # ensure warm code cache A = assemble(a, bcs=bc) A.force_evaluation() b = assemble(L) with matrix_timer: # Instead of this line (if you want to reassemble into the # same matrix), you would write: # A = assemble(a, bcs=bc, tensor=A) A = assemble(a, bcs=bc) A.force_evaluation() with rhs_timer: b = assemble(L, tensor=b) with solve_timer: solve(A, u, b) With this setup, you don't need to execute the script twice, but only once: the "in memory" warmup does the job. Hope that clarifies things. If you do this, I imagine that the scaling plots will make more sense (although please do let us know if you have other issues). Additionally, I recommend you also produce the scaling plots for the matrix and rhs assembly (to convince yourself that they are doing the right thing). Cheers, Lawrence
Thanks Lawrence! You were right, I was indeed doing: mpiexec -n ... python poisson.py mpiexec -n ... python poisson.py and then timing only the second run. And forcing the assembly of the matrices leads to much better scaling of the solve phase. Thanks for your help! On Tue, May 22, 2018 at 4:14 AM Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
Dear Siddhant,
comments inline below.
On 22/05/18 05:05, Siddhant Wahal wrote:
Hello,
I'm trying to reproduce the firedrake strong scaling experiments, as reported in this paper <https://arxiv.org/pdf/1501.01809.pdf>. However, I'm unable to reproduce similar scaling trends in my experiments.
Some of the details here are a little subtle, but I'll try my best to explain.
* The problem is the poisson equation in 3D. The preconditioner and linear solver are again the same as in the paper. At each core count, each problem was run twice and only the second run was timed.
If I look in the code, I see that you run a solve once. So presumably you mean you do:
mpiexec -n ... python poisson.py mpiexec -n ... python poisson.py
and only measure timings from the second run?
*Results:*
I'm attaching a strong scaling plot. This plot was created using 80x80x80 cubes in the mesh, and 3rd order continuous polynomials as the finite element function space. This results in ~14M DOFs in total.
The plot shows the maximum time spent solving the pre-assembled linear system, as measured by wrapping MPI timers around firedrake.solve(A, u, solver_parameters=param). Also reported is the time spent in the KSPSolve event. While the KSPSolve event scales well, the call to firedrake.solve doesn't. It seems like operation that occurs between firedrake.solve and before ksp.solve is causing the bottleneck.
So, the first thing to note here is that (for various implementation reasons) calls to `assemble` do not necessarily do the compute at that moment. Rather, they return an object with a promise that "when you need to use the data" the values will be computed.
So, in your code, you do (in pseudo-code):
with mat_timer: A = assemble(a, bcs=bc)
with rhs_timer: b = assemble(L) # incidentally, you don't need to apply the boundary conditions to # the rhs here, if a matrix has bcs attached, the solve will take # care to apply them correctly. bc.apply(b)
with solve_timer: solve(A, u, b)
The first two steps complete almost immediately (probably independent of the problem size), because they have just queued up some computation.
Now, when you come to solve the system, we know that you actually want the matrix and right hand side to be computed, so before handing it to the KSP, we actually perform the computation.
So, that would explain why the "solve" time is so much larger than the "KSPSolve" time.
It doesn't yet explain why you don't get the same kind of strong scaling performance observed in the paper.
To explain this, we need to understand a little of what goes on when you call assemble.
There are two compilation stages from the high-level UFL form to the code that executes the assembly:
1. compile UFL to an element kernel; 2. compile code that executes this element kernel over the mesh.
The latter potentially invokes the C compiler.
To avoid calling the compiler too often, we remember the compiled code on disk (so that the next time we see it, we don't compile again). Moreover, when you assemble the same form again (in the same run), we have already dynamically loaded the relevant code (so we don't need to touch the disk again).
It is quite possible that what you're seeing in your timings is the cost of touching the disk to find the code we need to execute. Especially on a supercomputer with a # shared filesystem, on many processes this can be somewhat slower than ideal.
In the Firedrake paper, we measure timings without these extra costs. The rationale for doing this is that although you might be reporting scaling for a single poisson solve, in all likelihood, you will probably need to do many such solves in your actual simulation. Hence, ignoring the amortized "setup" time is a reasonable thing to do.
For your code we can do this like so:
Add the additional parameter:
parameters["pyop2_options"]["lazy_evaluation"] = False
To time the matrix and rhs assembly:
# ensure warm code cache A = assemble(a, bcs=bc) A.force_evaluation() b = assemble(L)
with matrix_timer: # Instead of this line (if you want to reassemble into the # same matrix), you would write: # A = assemble(a, bcs=bc, tensor=A) A = assemble(a, bcs=bc) A.force_evaluation()
with rhs_timer: b = assemble(L, tensor=b)
with solve_timer: solve(A, u, b)
With this setup, you don't need to execute the script twice, but only once: the "in memory" warmup does the job.
Hope that clarifies things.
If you do this, I imagine that the scaling plots will make more sense (although please do let us know if you have other issues). Additionally, I recommend you also produce the scaling plots for the matrix and rhs assembly (to convince yourself that they are doing the right thing).
Cheers,
Lawrence
participants (2)
- 
                
                Lawrence Mitchell
- 
                
                Siddhant Wahal