On 7 Aug 2014, at 21:10, Tuomas Karna <tuomas.karna@gmail.com> wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles).
There are a few things that can go wrong here: 1. Poor load balancing leading to very variable assembly times per process, but if you have a reasonable mesh that's unlikely to be a problem. 2. Choosing a solver/preconditioner combination that takes longer to converge in parallel than serial (e.g. block-ILU), you can run with solver_parameters={'ksp_monitor': True} to see how many iterations you're taking. 3. We're aware of performance issues in our mesh decomposition strategy when running in parallel (but your mesh is quite small so I would not expect that to be a massive problem). Additionally, if you're running a large number of time steps that is just a setup cost that should be amortised. 4. Running out of strong scaling. With 76k triangles you've got around 300k velocity degrees of freedom and 40k pressure dofs. A reasonable rule of thumb is that for implicit solvers you can strong scale to around 10k dofs per process (assuming the code is generally efficient). So I'd expect to be able to see a little more than scalability up to 8 processes. But see below for some potential advice on how to squeeze some more strong scaling out.
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY?
From our side, the timings summary we print is the only data available in master branches. You can see PETSc's view of the world by running with the -log_summary command line option: mpiexec -n ... python foo.py -log_summary This will give you an indication of whether things like the solve itself is scaling. Running in serial, you can run with the python profiler to see which parts of the code are taking time: python -m cProfile -o profile.stats foo.py You can then get a nice call graph with percentage times in different routines underneath main by install gprof2dot and running gprof2dot -f pstats profile.stats | xdot Florian has been doing a reasonable amount of benchmarking recently, so he may be able to give more detailed advice on how to proceed. Finally, a little more advice on amortising python overhead, which is key for good strong scaling. We do a number of things to try and minimise the overhead involved in having a high-level symbolic representation, but you can write code in a few ways that will defeat (at least parts) of this setup. If you have a time step loop, you should define all the forms (as much as possible) outside the loop. We cache the kernel that FFC generates, but computing the cache key can be somewhat expensive, so we also cache the kernel on the particular form /object/, if that changes, then we go down the more expensive route. Furthermore, if you call the top-level "solve" routine inside a time stepping loop: while t < T: solve(a == L, ...) This builds and then tears down the solver and preconditioner every time step, as well as re-assembling the linear operator. Although we have a cache that notices that you're reassembling the same operator and just returns something from the cache, you still pay some overhead (and the price of setup and destruction of the solver object, which can be substantial especially if the preconditioner is heavyweight). To avoid this, you can instead build a solver /object/ outside the time stepping loop and then just call the solve method on it inside: problem = LinearVariationalProblem(a, L, u, bcs=..., ...) solver = LinearVariationalSolver(problem, solver_parameters={...}) while t < T: solver.solve() Note that in this latter case, when you're finally done with the solver you must call: solver.destroy() otherwise it will not be garbage collected. Hope this helps. Cheers, Lawrence