Hello, 
I'm trying to reproduce the firedrake strong scaling experiments, as reported in 
this paper. However, I'm unable to reproduce similar scaling trends in my experiments.
Here are some more details:
Machine:
I'm on TACC's stampede2, and running experiments on the skylake nodes. Each node has 2 sockets, and 24 physical cores on each socket.
PETSc:
The firedrake fork of PETSc was compiled with the 2017 intel compilers and the -O3 and skylake specific -xCORE-AVX512  flags.
Problem:
The problem is the poisson equation in 3D. The preconditioner and linear solver are again the same as in the paper. At each core count, each problem was run twice and only the second run was timed. 
Results:
I'm attaching a strong scaling plot. This plot was created using 80x80x80 cubes in the mesh, and 3rd order continuous polynomials as the finite element function space. This results in ~14M DOFs in total.
 The plot shows the maximum time spent solving the pre-assembled linear system, as measured by wrapping MPI timers around firedrake.solve(A, u, solver_parameters=param). Also reported is the time spent in the KSPSolve event. While the KSPSolve event scales well, the call to firedrake.solve doesn't. It seems like operation that occurs between firedrake.solve and before ksp.solve is causing the bottleneck.
I'm also attaching the python script I used to solve the Poisson equation, and a sample output from PETSc's -log_view option. 
Any hints as to why this isn't scaling will be appreciated!