Hi Florian, On 08/08/2014 02:53 AM, Florian Rathgeber wrote:
Hi Tuomas,
On 07/08/14 21:10, Tuomas Karna wrote:
Hi Lawrence,
Thanks for these, this is great. On stampede I can access the compilers in the devel queue so this works like a charm.
Would you have any hints for improving parallel scalabilty? Right now things start to level out at 8 cores (2d shallow water with p2-p1 elems, 76k triangles). Are you strong scaling 2d shallow water on a 76k triangle mesh? If so, you get quite a small number of DOFs on 8 cores. Can you say which part isn't scaling? The assembly? The solve?
Is your test case accessible somewhere? Yes, it will level out at some point, but I was expecting a bit better strong scaling. It seems that both SNES solver and ParLoop compute/kernel are taking most of the time and scaling equally poorly. Unfortunately the test scenario is not online at the moment. For further benchmarks, you can have a look at: https://github.com/firedrakeproject/firedrake-bench
Also, is there another tool for timings than PYOP2_PRINT_SUMMARY? We're still working on improving the profiling capabilities built into PyOP2/Firedrake. There's a few things you can do already [1]:
1) Define your own timers with timed_region or timed_function
from pyop2.profiling import timed_region, timed_function
with timed_region("my code"): # my code
@timed_function("my function") def my_func(): ...
2) Access the timers programmatically:
from pyop2 import timing timing("my timer") # get total time timing("my timer", total=False) # get average time per call Thanks, these are very useful.
Cheers, Tuomas
Cheers, Florian
[1]: https://github.com/OP2/PyOP2/wiki/%5Bguide%5D-Profiling#using-pyop2s-interna...
Cheers,
Tuomas
Hi Tuomas,
comments in line below.
On 7 Aug 2014, at 07:31, Tuomas Karna <tuomas.karna at gmail.com <https://mailman.ic.ac.uk/mailman/listinfo/firedrake>> wrote:
/ Hi all, />/ />/ I'm running my code on TACC Stampede and I have a couple of questions: />/ />/ How can I setup Firedrake/PyOP2 for the target machine, for example to use Intel compilers, target architecture (sandybridge) instruction set, or Intel MKL libraries? / Note that for all this to work, you need to be able to launch the compiler on compute nodes (in particular for intel, the compute nodes will have to be able to talk to the license server), but I hope this all works.
Selection of the intel compiler is:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
This still launches 'mpicc' but uses intel specific flags when compiling. If the compiler is actually called something different (e.g. on Cray systems it's always cc) then you need to do:
export CC=name_of_cc_compiler # only do this if the linker is separate from the compiler export LDSHARED=name_of_linker
and then:
from firedrake import * parameters["coffee"]["compiler"] = "intel"
In addition, you probably want to select the AVX instruction set:
parameters["coffee"]["simd_isa"] = "avx"
Finally, you can select various code transformations to be applied to the generated kernels to increase their performance (see http://arxiv.org/abs/1407.0904 for many details). In particular you may wish to try:
# loop invariant code motion parameters["coffee"]["licm"] = True
# Data alignment (to give compilers a better chance at vectorising) parameters["coffee"]["ap"] = True
# Request the compiler try harder to auto-vectorise from pyop2.coffee.ast_plan import AUTOVECT parameters["coffee"]["vect"] = (AUTOVECT, -1)
Our experience is that for most forms, the big gains come from loop invariant code motion and data alignment, but Fabio can comment in much more detail on this.
Finally, usage of MKL, by default, none of the code Firedrake/PyOP2 generate uses BLAS at all, so if you want to exploit BLAS in the solver library, you just needed to compile PETSc appropriately. You can apply code transformations to the kernels to convert them into BLAS calls, but for low order, it may not be much of a win (since we do not amortise the call-overhead across elements), but you can try it:
# BLAS-transformations aren't exposed through the parameters dict so: from firedrake import * op2.configuration["blas"] = "mkl"
...
You can also, if it's available, try Eigen instead:
op2.configuration["blas"] = "eigen"
/ Until now I've only used MPI. What do I need to do to run with openmp or cuda for instance? I've got the PyOP2 dependencies in place, as they are listed on the Firedrake website. / Running with openmp requires that you've built PETSc "-with-threadcomm --with-openmp --with-pthreadclasses". You can then either do:
export PYOP2_BACKEND=openmp # Or however your batch system wants you to do this. export OMP_NUM_THREADS=... mpiexec ... python script.py
or:
from firedrake import * op2.init(backend='openmp') ...
Note that our current experience is that the OpenMP backend does not offer many (if any) performance improvements over just MPI mode, if you're solving big elliptic problems using AMG, you may wish to use it due to decreased memory pressure.
CUDA only works for a subset of firedrake functionality. In particular, the following currently don't work:
- Solving mixed systems - Nonlinear solves - Extruded meshes - Solving linear systems with CUDA + MPI
Furthermore, for moderately complicated forms, our current strategy for parallelisation on GPUs is far from optimal (to the point where the compiler can often fail), work is starting in this area, but it's only in early stages.
Florian has been doing some recent benchmarking in this area, so perhaps he will comment in more detail.
/ Also, does anyone have experience on using Intel MIC coprocessors with Firedrake? / I believe Fabio has done some work in this region, but I'm not sure if he did it running the full toolchain, perhaps he can comment.
Cheers,
Lawrence
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake