Re: [firedrake] Firedrake on supercomputers
Hi Lawrence, Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm. Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles). Also, is there another tool for timings than PYOP2_PRINT_SUMMARY? Cheers, Tuomas
Hi Tuomas,
comments in line below.
On 7 Aug 2014, at 07:31, Tuomas Karna <tuomas.karna at gmail.com <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>> wrote:
/ Hi all, />/ />/ I'm running my code on TACC Stampede and I have a couple of questions: />/ />/ How can I setup Firedrake/PyOP2 for the target machine, for example to use Intel compilers, target architecture (sandybridge) instruction set, or Intel MKL libraries? / Note that for all this to work, you need to be able to launch the compiler on compute nodes (in particular for intel, the compute nodes will have to be able to talk to the license server), but I hope this all works.
Selection of the intel compiler is:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
This still launches 'mpicc' but uses intel specific flags when compiling. If the compiler is actually called something different (e.g. on Cray systems it's always cc) then you need to do:
export CC=name_of_cc_compiler # only do this if the linker is separate from the compiler export LDSHARED=name_of_linker
and then:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
In addition, you probably want to select the AVX instruction set:
parameters["coffee"]["simd_isa"] = "avx"
Finally, you can select various code transformations to be applied to the generated kernels to increase their performance (seehttp://arxiv.org/abs/1407.0904 for many details). In particular you may wish to try:
# loop invariant code motion parameters["coffee"]["licm"] = True
# Data alignment (to give compilers a better chance at vectorising) parameters["coffee"]["ap"] = True
# Request the compiler try harder to auto-vectorise from pyop2.coffee.ast_plan import AUTOVECT parameters["coffee"]["vect"] = (AUTOVECT, -1)
Our experience is that for most forms, the big gains come from loop invariant code motion and data alignment, but Fabio can comment in much more detail on this.
Finally, usage of MKL, by default, none of the code Firedrake/PyOP2 generate uses BLAS at all, so if you want to exploit BLAS in the solver library, you just needed to compile PETSc appropriately. You can apply code transformations to the kernels to convert them into BLAS calls, but for low order, it may not be much of a win (since we do not amortise the call-overhead across elements), but you can try it:
# BLAS-transformations aren't exposed through the parameters dict so: from firedrake import * op2.configuration["blas"] = "mkl"
...
You can also, if it's available, try Eigen instead:
op2.configuration["blas"] = "eigen"
/ Until now I've only used MPI. What do I need to do to run with openmp or cuda for instance? I've got the PyOP2 dependencies in place, as they are listed on the Firedrake website. / Running with openmp requires that you've built PETSc "-with-threadcomm --with-openmp --with-pthreadclasses". You can then either do:
export PYOP2_BACKEND=openmp # Or however your batch system wants you to do this. export OMP_NUM_THREADS=... mpiexec ... python script.py
or:
from firedrake import * op2.init(backend='openmp') ...
Note that our current experience is that the OpenMP backend does not offer many (if any) performance improvements over just MPI mode, if you're solving big elliptic problems using AMG, you may wish to use it due to decreased memory pressure.
CUDA only works for a subset of firedrake functionality. In particular, the following currently don't work:
- Solving mixed systems - Nonlinear solves - Extruded meshes - Solving linear systems with CUDA + MPI
Furthermore, for moderately complicated forms, our current strategy for parallelisation on GPUs is far from optimal (to the point where the compiler can often fail), work is starting in this area, but it's only in early stages.
Florian has been doing some recent benchmarking in this area, so perhaps he will comment in more detail.
/ Also, does anyone have experience on using Intel MIC coprocessors with Firedrake? / I believe Fabio has done some work in this region, but I'm not sure if he did it running the full toolchain, perhaps he can comment.
Cheers,
Lawrence
On 7 Aug 2014, at 21:10, Tuomas Karna <tuomas.karna@gmail.com> wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles).
There are a few things that can go wrong here: 1. Poor load balancing leading to very variable assembly times per process, but if you have a reasonable mesh that's unlikely to be a problem. 2. Choosing a solver/preconditioner combination that takes longer to converge in parallel than serial (e.g. block-ILU), you can run with solver_parameters={'ksp_monitor': True} to see how many iterations you're taking. 3. We're aware of performance issues in our mesh decomposition strategy when running in parallel (but your mesh is quite small so I would not expect that to be a massive problem). Additionally, if you're running a large number of time steps that is just a setup cost that should be amortised. 4. Running out of strong scaling. With 76k triangles you've got around 300k velocity degrees of freedom and 40k pressure dofs. A reasonable rule of thumb is that for implicit solvers you can strong scale to around 10k dofs per process (assuming the code is generally efficient). So I'd expect to be able to see a little more than scalability up to 8 processes. But see below for some potential advice on how to squeeze some more strong scaling out.
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY?
From our side, the timings summary we print is the only data available in master branches. You can see PETSc's view of the world by running with the -log_summary command line option: mpiexec -n ... python foo.py -log_summary This will give you an indication of whether things like the solve itself is scaling. Running in serial, you can run with the python profiler to see which parts of the code are taking time: python -m cProfile -o profile.stats foo.py You can then get a nice call graph with percentage times in different routines underneath main by install gprof2dot and running gprof2dot -f pstats profile.stats | xdot Florian has been doing a reasonable amount of benchmarking recently, so he may be able to give more detailed advice on how to proceed. Finally, a little more advice on amortising python overhead, which is key for good strong scaling. We do a number of things to try and minimise the overhead involved in having a high-level symbolic representation, but you can write code in a few ways that will defeat (at least parts) of this setup. If you have a time step loop, you should define all the forms (as much as possible) outside the loop. We cache the kernel that FFC generates, but computing the cache key can be somewhat expensive, so we also cache the kernel on the particular form /object/, if that changes, then we go down the more expensive route. Furthermore, if you call the top-level "solve" routine inside a time stepping loop: while t < T: solve(a == L, ...) This builds and then tears down the solver and preconditioner every time step, as well as re-assembling the linear operator. Although we have a cache that notices that you're reassembling the same operator and just returns something from the cache, you still pay some overhead (and the price of setup and destruction of the solver object, which can be substantial especially if the preconditioner is heavyweight). To avoid this, you can instead build a solver /object/ outside the time stepping loop and then just call the solve method on it inside: problem = LinearVariationalProblem(a, L, u, bcs=..., ...) solver = LinearVariationalSolver(problem, solver_parameters={...}) while t < T: solver.solve() Note that in this latter case, when you're finally done with the solver you must call: solver.destroy() otherwise it will not be garbage collected. Hope this helps. Cheers, Lawrence
Lawrence, On 08/08/2014 02:34 AM, Lawrence Mitchell wrote:
On 7 Aug 2014, at 21:10, Tuomas Karna <tuomas.karna@gmail.com> wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles).
There are a few things that can go wrong here:
1. Poor load balancing leading to very variable assembly times per process, but if you have a reasonable mesh that's unlikely to be a problem.
2. Choosing a solver/preconditioner combination that takes longer to converge in parallel than serial (e.g. block-ILU), you can run with solver_parameters={'ksp_monitor': True} to see how many iterations you're taking.
3. We're aware of performance issues in our mesh decomposition strategy when running in parallel (but your mesh is quite small so I would not expect that to be a massive problem). Additionally, if you're running a large number of time steps that is just a setup cost that should be amortised.
4. Running out of strong scaling. With 76k triangles you've got around 300k velocity degrees of freedom and 40k pressure dofs. A reasonable rule of thumb is that for implicit solvers you can strong scale to around 10k dofs per process (assuming the code is generally efficient). So I'd expect to be able to see a little more than scalability up to 8 processes. But see below for some potential advice on how to squeeze some more strong scaling out. Load balancing may be an issue on this grid, haven't looked at that yet. I'm also seeing performance degradation over time in parallel, so I suspect there's a bug somewhere but haven't dug deep enough to find it.
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY? From our side, the timings summary we print is the only data available in master branches.
You can see PETSc's view of the world by running with the -log_summary command line option:
mpiexec -n ... python foo.py -log_summary
This will give you an indication of whether things like the solve itself is scaling.
Running in serial, you can run with the python profiler to see which parts of the code are taking time:
python -m cProfile -o profile.stats foo.py
You can then get a nice call graph with percentage times in different routines underneath main by install gprof2dot and running
gprof2dot -f pstats profile.stats | xdot
Florian has been doing a reasonable amount of benchmarking recently, so he may be able to give more detailed advice on how to proceed. Finally, a little more advice on amortising python overhead, which is key for good strong scaling.
We do a number of things to try and minimise the overhead involved in having a high-level symbolic representation, but you can write code in a few ways that will defeat (at least parts) of this setup.
If you have a time step loop, you should define all the forms (as much as possible) outside the loop. We cache the kernel that FFC generates, but computing the cache key can be somewhat expensive, so we also cache the kernel on the particular form /object/, if that changes, then we go down the more expensive route.
Furthermore, if you call the top-level "solve" routine inside a time stepping loop:
while t < T: solve(a == L, ...)
This builds and then tears down the solver and preconditioner every time step, as well as re-assembling the linear operator. Although we have a cache that notices that you're reassembling the same operator and just returns something from the cache, you still pay some overhead (and the price of setup and destruction of the solver object, which can be substantial especially if the preconditioner is heavyweight).
To avoid this, you can instead build a solver /object/ outside the time stepping loop and then just call the solve method on it inside:
problem = LinearVariationalProblem(a, L, u, bcs=..., ...) solver = LinearVariationalSolver(problem, solver_parameters={...})
while t < T: solver.solve()
Note that in this latter case, when you're finally done with the solver you must call:
solver.destroy()
otherwise it will not be garbage collected. Good info, I had all the forms declared outside the time loop, but was using the high-level solve routine. That made a difference on run times.
Cheers, Tuomas
Hope this helps.
Cheers,
Lawrence
Dear Tuomas, a somewhat late reply. On 11/08/14 22:49, Tuomas Karna wrote: ...
Load balancing may be an issue on this grid, haven't looked at that yet. I'm also seeing performance degradation over time in parallel, so I suspect there's a bug somewhere but haven't dug deep enough to find it.
Eventually, we found this bug, there was a problem in the way we were preallocating matrices in parallel when running non-linear problems that had zero initial conditions, resulting in performance degradation when assembling Jacobians. If you get a chance to run with current firedrake [352295c] and PyOP2 [adb4114] master branches, you should hopefully see that this has gone away. Cheers, Lawrence
Hi Tuomas, On 07/08/14 21:10, Tuomas Karna wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles).
Are you strong scaling 2d shallow water on a 76k triangle mesh? If so, you get quite a small number of DOFs on 8 cores. Can you say which part isn't scaling? The assembly? The solve? Is your test case accessible somewhere? For further benchmarks, you can have a look at: https://github.com/firedrakeproject/firedrake-bench
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY?
We're still working on improving the profiling capabilities built into PyOP2/Firedrake. There's a few things you can do already [1]: 1) Define your own timers with timed_region or timed_function from pyop2.profiling import timed_region, timed_function with timed_region("my code"): # my code @timed_function("my function") def my_func(): ... 2) Access the timers programmatically: from pyop2 import timing timing("my timer") # get total time timing("my timer", total=False) # get average time per call Cheers, Florian [1]: https://github.com/OP2/PyOP2/wiki/%5Bguide%5D-Profiling#using-pyop2s-interna...
Cheers,
Tuomas
Hi Tuomas,
comments in line below.
On 7 Aug 2014, at 07:31, Tuomas Karna <tuomas.karna at gmail.com <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>> wrote:
/ Hi all, />/ />/ I'm running my code on TACC Stampede and I have a couple of questions: />/ />/ How can I setup Firedrake/PyOP2 for the target machine, for example to use Intel compilers, target architecture (sandybridge) instruction set, or Intel MKL libraries? / Note that for all this to work, you need to be able to launch the compiler on compute nodes (in particular for intel, the compute nodes will have to be able to talk to the license server), but I hope this all works.
Selection of the intel compiler is:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
This still launches 'mpicc' but uses intel specific flags when compiling. If the compiler is actually called something different (e.g. on Cray systems it's always cc) then you need to do:
export CC=name_of_cc_compiler # only do this if the linker is separate from the compiler export LDSHARED=name_of_linker
and then:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
In addition, you probably want to select the AVX instruction set:
parameters["coffee"]["simd_isa"] = "avx"
Finally, you can select various code transformations to be applied to the generated kernels to increase their performance (see http://arxiv.org/abs/1407.0904 for many details). In particular you may wish to try:
# loop invariant code motion parameters["coffee"]["licm"] = True
# Data alignment (to give compilers a better chance at vectorising) parameters["coffee"]["ap"] = True
# Request the compiler try harder to auto-vectorise from pyop2.coffee.ast_plan import AUTOVECT parameters["coffee"]["vect"] = (AUTOVECT, -1)
Our experience is that for most forms, the big gains come from loop invariant code motion and data alignment, but Fabio can comment in much more detail on this.
Finally, usage of MKL, by default, none of the code Firedrake/PyOP2 generate uses BLAS at all, so if you want to exploit BLAS in the solver library, you just needed to compile PETSc appropriately. You can apply code transformations to the kernels to convert them into BLAS calls, but for low order, it may not be much of a win (since we do not amortise the call-overhead across elements), but you can try it:
# BLAS-transformations aren't exposed through the parameters dict so: from firedrake import * op2.configuration["blas"] = "mkl"
...
You can also, if it's available, try Eigen instead:
op2.configuration["blas"] = "eigen"
/ Until now I've only used MPI. What do I need to do to run with openmp or cuda for instance? I've got the PyOP2 dependencies in place, as they are listed on the Firedrake website. / Running with openmp requires that you've built PETSc "-with-threadcomm --with-openmp --with-pthreadclasses". You can then either do:
export PYOP2_BACKEND=openmp # Or however your batch system wants you to do this. export OMP_NUM_THREADS=... mpiexec ... python script.py
or:
from firedrake import * op2.init(backend='openmp') ...
Note that our current experience is that the OpenMP backend does not offer many (if any) performance improvements over just MPI mode, if you're solving big elliptic problems using AMG, you may wish to use it due to decreased memory pressure.
CUDA only works for a subset of firedrake functionality. In particular, the following currently don't work:
- Solving mixed systems - Nonlinear solves - Extruded meshes - Solving linear systems with CUDA + MPI
Furthermore, for moderately complicated forms, our current strategy for parallelisation on GPUs is far from optimal (to the point where the compiler can often fail), work is starting in this area, but it's only in early stages.
Florian has been doing some recent benchmarking in this area, so perhaps he will comment in more detail.
/ Also, does anyone have experience on using Intel MIC coprocessors with Firedrake? / I believe Fabio has done some work in this region, but I'm not sure if he did it running the full toolchain, perhaps he can comment.
Cheers,
Lawrence
Hi Florian, On 08/08/2014 02:53 AM, Florian Rathgeber wrote:
Hi Tuomas,
On 07/08/14 21:10, Tuomas Karna wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles). Are you strong scaling 2d shallow water on a 76k triangle mesh? If so, you get quite a small number of DOFs on 8 cores. Can you say which part isn't scaling? The assembly? The solve?
Is your test case accessible somewhere? Yes, it will level out at some point, but I was expecting a bit better strong scaling. It seems that both SNES solver and ParLoop compute/kernel are taking most of the time and scaling equally poorly. Unfortunately the test scenario is not online at the moment. For further benchmarks, you can have a look at: https://github.com/firedrakeproject/firedrake-bench
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY? We're still working on improving the profiling capabilities built into PyOP2/Firedrake. There's a few things you can do already [1]:
1) Define your own timers with timed_region or timed_function
from pyop2.profiling import timed_region, timed_function
with timed_region("my code"): # my code
@timed_function("my function") def my_func(): ...
2) Access the timers programmatically:
from pyop2 import timing timing("my timer") # get total time timing("my timer", total=False) # get average time per call Thanks, these are very useful.
Cheers, Tuomas
Cheers, Florian
[1]: https://github.com/OP2/PyOP2/wiki/%5Bguide%5D-Profiling#using-pyop2s-interna...
Cheers,
Tuomas
Hi Tuomas,
comments in line below.
On 7 Aug 2014, at 07:31, Tuomas Karna <tuomas.karna at gmail.com <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>> wrote:
/ Hi all, />/ />/ I'm running my code on TACC Stampede and I have a couple of questions: />/ />/ How can I setup Firedrake/PyOP2 for the target machine, for example to use Intel compilers, target architecture (sandybridge) instruction set, or Intel MKL libraries? / Note that for all this to work, you need to be able to launch the compiler on compute nodes (in particular for intel, the compute nodes will have to be able to talk to the license server), but I hope this all works.
Selection of the intel compiler is:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
This still launches 'mpicc' but uses intel specific flags when compiling. If the compiler is actually called something different (e.g. on Cray systems it's always cc) then you need to do:
export CC=name_of_cc_compiler # only do this if the linker is separate from the compiler export LDSHARED=name_of_linker
and then:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
In addition, you probably want to select the AVX instruction set:
parameters["coffee"]["simd_isa"] = "avx"
Finally, you can select various code transformations to be applied to the generated kernels to increase their performance (see http://arxiv.org/abs/1407.0904 for many details). In particular you may wish to try:
# loop invariant code motion parameters["coffee"]["licm"] = True
# Data alignment (to give compilers a better chance at vectorising) parameters["coffee"]["ap"] = True
# Request the compiler try harder to auto-vectorise from pyop2.coffee.ast_plan import AUTOVECT parameters["coffee"]["vect"] = (AUTOVECT, -1)
Our experience is that for most forms, the big gains come from loop invariant code motion and data alignment, but Fabio can comment in much more detail on this.
Finally, usage of MKL, by default, none of the code Firedrake/PyOP2 generate uses BLAS at all, so if you want to exploit BLAS in the solver library, you just needed to compile PETSc appropriately. You can apply code transformations to the kernels to convert them into BLAS calls, but for low order, it may not be much of a win (since we do not amortise the call-overhead across elements), but you can try it:
# BLAS-transformations aren't exposed through the parameters dict so: from firedrake import * op2.configuration["blas"] = "mkl"
...
You can also, if it's available, try Eigen instead:
op2.configuration["blas"] = "eigen"
/ Until now I've only used MPI. What do I need to do to run with openmp or cuda for instance? I've got the PyOP2 dependencies in place, as they are listed on the Firedrake website. / Running with openmp requires that you've built PETSc "-with-threadcomm --with-openmp --with-pthreadclasses". You can then either do:
export PYOP2_BACKEND=openmp # Or however your batch system wants you to do this. export OMP_NUM_THREADS=... mpiexec ... python script.py
or:
from firedrake import * op2.init(backend='openmp') ...
Note that our current experience is that the OpenMP backend does not offer many (if any) performance improvements over just MPI mode, if you're solving big elliptic problems using AMG, you may wish to use it due to decreased memory pressure.
CUDA only works for a subset of firedrake functionality. In particular, the following currently don't work:
- Solving mixed systems - Nonlinear solves - Extruded meshes - Solving linear systems with CUDA + MPI
Furthermore, for moderately complicated forms, our current strategy for parallelisation on GPUs is far from optimal (to the point where the compiler can often fail), work is starting in this area, but it's only in early stages.
Florian has been doing some recent benchmarking in this area, so perhaps he will comment in more detail.
/ Also, does anyone have experience on using Intel MIC coprocessors with Firedrake? / I believe Fabio has done some work in this region, but I'm not sure if he did it running the full toolchain, perhaps he can comment.
Cheers,
Lawrence
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
participants (3)
- 
                
                Florian Rathgeber
- 
                
                Lawrence Mitchell
- 
                
                Tuomas Karna