Thanks Lawrence!

You were right, I was indeed doing:

mpiexec -n ... python poisson.py
mpiexec -n ... python poisson.py

and then timing only the second run. 

And forcing the assembly of the matrices leads to much better scaling of the solve phase.

Thanks for your help!



On Tue, May 22, 2018 at 4:14 AM Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk> wrote:
Dear Siddhant,

comments inline below.

On 22/05/18 05:05, Siddhant Wahal wrote:
> Hello, 
>
> I'm trying to reproduce the firedrake strong scaling experiments, as
> reported in this paper <https://arxiv.org/pdf/1501.01809.pdf>.
> However, I'm unable to reproduce similar scaling trends in my experiments.

Some of the details here are a little subtle, but I'll try my best to
explain.


> *
> The problem is the poisson equation in 3D. The preconditioner and
> linear solver are again the same as in the paper. At each core count,
> each problem was run twice and only the second run was timed. 

If I look in the code, I see that you run a solve once.  So presumably
you mean you do:

mpiexec -n ... python poisson.py
mpiexec -n ... python poisson.py

and only measure timings from the second run?

> *Results:*
>
> I'm attaching a strong scaling plot. This plot was created using
> 80x80x80 cubes in the mesh, and 3rd order continuous polynomials as
> the finite element function space. This results in ~14M DOFs in total.

> The plot shows the maximum time spent solving the pre-assembled
> linear system, as measured by wrapping MPI timers around
> firedrake.solve(A, u, solver_parameters=param). Also reported is the
> time spent in the KSPSolve event. While the KSPSolve event scales
> well, the call to firedrake.solve doesn't. It seems like operation
> that occurs between firedrake.solve and before ksp.solve is causing
> the bottleneck.

So, the first thing to note here is that (for various implementation
reasons) calls to `assemble` do not necessarily do the compute at that
moment.  Rather, they return an object with a promise that "when you
need to use the data" the values will be computed.

So, in your code, you do (in pseudo-code):

with mat_timer:
   A = assemble(a, bcs=bc)

with rhs_timer:
   b = assemble(L)
   # incidentally, you don't need to apply the boundary conditions to
   # the rhs here, if a matrix has bcs attached, the solve will take
   # care to apply them correctly.
   bc.apply(b)

with solve_timer:
   solve(A, u, b)

The first two steps complete almost immediately (probably independent
of the problem size), because they have just queued up some computation.

Now, when you come to solve the system, we know that you actually want
the matrix and right hand side to be computed, so before handing it to
the KSP, we actually perform the computation.

So, that would explain why the "solve" time is so much larger than the
"KSPSolve" time.

It doesn't yet explain why you don't get the same kind of strong
scaling performance observed in the paper.

To explain this, we need to understand a little of what goes on when
you call assemble.

There are two compilation stages from the high-level UFL form to the
code that executes the assembly:

1. compile UFL to an element kernel;
2. compile code that executes this element kernel over the mesh.

The latter potentially invokes the C compiler.

To avoid calling the compiler too often, we remember the compiled code
on disk (so that the next time we see it, we don't compile again).
Moreover, when you assemble the same form again (in the same run), we
have already dynamically loaded the relevant code (so we don't need to
touch the disk again).

It is quite possible that what you're seeing in your timings is the
cost of touching the disk to find the code we need to execute.
Especially on a supercomputer with a    #
shared filesystem, on many processes this can be somewhat slower than
ideal.

In the Firedrake paper, we measure timings without these extra costs.
The rationale for doing this is that although you might be reporting
scaling for a single poisson solve, in all likelihood, you will
probably need to do many such solves in your actual simulation.
Hence, ignoring the amortized "setup" time is a reasonable thing to do.

For your code we can do this like so:

Add the additional parameter:

parameters["pyop2_options"]["lazy_evaluation"] = False

To time the matrix and rhs assembly:

# ensure warm code cache
A = assemble(a, bcs=bc)
A.force_evaluation()
b = assemble(L)

with matrix_timer:
   # Instead of this line (if you want to reassemble into the
   # same matrix), you would write:
   # A = assemble(a, bcs=bc, tensor=A)
   A = assemble(a, bcs=bc)
   A.force_evaluation()

with rhs_timer:
   b = assemble(L, tensor=b)

with solve_timer:
   solve(A, u, b)


With this setup, you don't need to execute the script twice, but only
once: the "in memory" warmup does the job.

Hope that clarifies things.

If you do this, I imagine that the scaling plots will make more sense
(although please do let us know if you have other issues).
Additionally, I recommend you also produce the scaling plots for the
matrix and rhs assembly (to convince yourself that they are doing the
right thing).

Cheers,

Lawrence