Re: [firedrake] cached kernels
Hi Lawrence (copied to firedrake, since overheads from loading libraries might be a general concern), I tried it on ARCHER and adding caching for the kernels does not make any difference. The LU solve performance at lowest order is poor, but an individual call takes actually more time (~0.01s) than the operator application (~0.001s), so I would have thought the overheads are actually relatively smaller for the LU solve. For the operator application the reported BW is excellent, but for the LU solve it is very poor. At higher order both BWs are good, here the data volume is larger, but the time for one LU solve call is still ~0.01s. Maybe in this case any overhead that shows up at lowest order is hidden. Could there be an overhead from loading the LAPACK library, which is required for the LU solve? The operator application does not require LAPACK, so this would be consistent with the observations. Are all libraries statically or dynamically linked? I thought I shouldn't see an overhead since PyOP2 loads the libraries in the warmup run, but maybe not anything that's dynamically linked? At higher order, where there is more work, this would then be hidden. Looking at the total time per iteration, the matrix-free time (which is dominated by the LU solve) is definitely worse than the PETSc time. Thanks, Eike Sent from my iPad
On 5 Nov 2015, at 19:44, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Hi Lawrence,
I added some code for caching the kernels in the BandedMatrix class, see “cache_kernels” branch, which I pushed to origin. Basically, I wrapped every op2.Kernel(…) call in BandedMatrix with the class method _cached_kernel(). This method checks if the kernel has been added to a dictionary before, and retrieves it. However, even if I *don’t* do this and simply construct the kernels with
kernel = op2.Kernel(…)
and then print out their id with
print hex(id(kernel))
I get the same number in all calls. This looks like the kernel has already been cached, but maybe it is simply because id() can return the same value for objects which have non-overlapping lifetimes? However, if I ensure that the kernel object is not destroyed (for example by appending to a list of kernel objects), I still observe the same behaviour: all entries in the list have the same if.
Thanks,
Eike
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 08/11/15 10:06, Eike Mueller wrote:
Hi Lawrence (copied to firedrake, since overheads from loading libraries might be a general concern),
I tried it on ARCHER and adding caching for the kernels does not make any difference. The LU solve performance at lowest order is poor, but an individual call takes actually more time (~0.01s) than the operator application (~0.001s), so I would have thought the overheads are actually relatively smaller for the LU solve. For the operator application the reported BW is excellent, but for the LU solve it is very poor. At higher order both BWs are good, here the data volume is larger, but the time for one LU solve call is still ~0.01s. Maybe in this case any overhead that shows up at lowest order is hidden.
Could there be an overhead from loading the LAPACK library, which is required for the LU solve?
This isn't how dynamic loading works. The first time you load the .so, in the warmup phase, the symbol is resolved, and the trampoline is replaced by a direct call. I have effectively no idea what's going on. Does the LU solve take this long on this much data if you just call it from C? IOW, I think it's not "our" fault, unless somehow you're managing to get a recompile or similar every time you call _lu_solve. Lawrence -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAEBAgAGBQJWQGtBAAoJECOc1kQ8PEYv+PUH/0o9la78TbSn7UTWe9anzMwC o4GkJ0lfbwvmZ6PWI+fPzrsH4lnR1AOiWSvG/BBNIW4SQvMhx50otImyeQePZ+9s 7uZqOcKdyvsRncFDSpdlND5eDO4+o9QVfINrmw4W9eXe9WsIUPHAWNsINkvyqnfX GlW8dRynKoIPqs7ZR3DfNHUF0RRtbY3z4Zo/jjeDzGXnvdXVagmhLRG17UQ2WB8H p8qSFBTNgnSKS1kKvUNlaR0cL2agTuoPSAY6ITnb7hJzBxSGXrWNcj8dFuune6hi wWkSxS5Y2Lgio+X/Jw36zMUdBTXLzwWSfjBhiYpHgch9zGXAskNSMdZM8DbeA3I= =7KZc -----END PGP SIGNATURE-----
Hi Lawrence,
Could there be an overhead from loading the LAPACK library, which is required for the LU solve?
This isn't how dynamic loading works. The first time you load the .so, in the warmup phase, the symbol is resolved, and the trampoline is replaced by a direct call.
Yes, that's what I also thought, just wanted to double check.
I have effectively no idea what's going on. Does the LU solve take this long on this much data if you just call it from C?
I agree, that seems to be the obvious test to do. Maybe there is something strange going on in the LAPACK. I will have a go at that, and see if I get exactly the same time as if I run it through firedrake.
IOW, I think it's not "our" fault, unless somehow you're managing to get a recompile or similar every time you call _lu_solve.
Lawrence
Thanks, Eike -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAEBAgAGBQJWQGtBAAoJECOc1kQ8PEYv+PUH/0o9la78TbSn7UTWe9anzMwC o4GkJ0lfbwvmZ6PWI+fPzrsH4lnR1AOiWSvG/BBNIW4SQvMhx50otImyeQePZ+9s 7uZqOcKdyvsRncFDSpdlND5eDO4+o9QVfINrmw4W9eXe9WsIUPHAWNsINkvyqnfX GlW8dRynKoIPqs7ZR3DfNHUF0RRtbY3z4Zo/jjeDzGXnvdXVagmhLRG17UQ2WB8H p8qSFBTNgnSKS1kKvUNlaR0cL2agTuoPSAY6ITnb7hJzBxSGXrWNcj8dFuune6hi wWkSxS5Y2Lgio+X/Jw36zMUdBTXLzwWSfjBhiYpHgch9zGXAskNSMdZM8DbeA3I= =7KZc -----END PGP SIGNATURE-----
Hi Lawrence,
I have effectively no idea what's going on. Does the LU solve take this long on this much data if you just call it from C?
I just carried out exactly that experiment (see the C-code in firedrake-multigridpaper/code/test_lusolve). Basically in each vertical column I do an LU solve with exactly the same matrix size and bandwidth as in the firedrake code. I use the same horizontal grid size as for the firedrake code and also use the compiler/linker flags which are printed out at the end of the run by PETSc, so I’m confident that the C-code does exactly the same as the code autogenerated by firedrake (only difference is that I initialise the matrix with random values, but make it diagonally dominant to avoid excessive pivoting). Interestingly I can reproduce the problem in the pure C-code, so there must be an issue with the LAPACK/BLAS on ARCHER or the matrices are simply to small to get good performance (for example because the indirect addressing in the horizontal avoids efficient reuse of higher-level caches). More specifically I get the following bandwidths (calculated by working out the data volume that is streamed through for every LU solve and dividing this by the measured runtime): *** lowest order *** (81920 vertical columns, matrix size = 64x64, matrix bandwidth=3, 24 cores on ARCHER) Measured memory bandwidth = 0.530GB/s (per core), 12.722GB/s (per node) *** higher order *** (5120 vertical columns, matrix size = 384x384, matrix bandwidth=55, 24 cores on ARCHER) Measured memory bandwidth = 3.601GB/s (per core), 86.434GB/s (per node) so the higher order case is running at bandwidth peak, but the lowest order case is far away from it. That implies that the problem lies with the BLAS/LAPACK implementation, not hidden firedrake overheads. Probably the way forward is to follow this up with ARCHER support, I can now give them a well defined test case which reproduces the problem. Any other ideas? Thanks, Eike
On 12/11/15 11:39, Eike Mueller wrote:
Hi Lawrence,
I have effectively no idea what's going on. Does the LU solve take this long on this much data if you just call it from C?
I just carried out exactly that experiment (see the C-code in firedrake-multigridpaper/code/test_lusolve). Basically in each vertical column I do an LU solve with exactly the same matrix size and bandwidth as in the firedrake code. I use the same horizontal grid size as for the firedrake code and also use the compiler/linker flags which are printed out at the end of the run by PETSc, so I’m confident that the C-code does exactly the same as the code autogenerated by firedrake (only difference is that I initialise the matrix with random values, but make it diagonally dominant to avoid excessive pivoting).
Interestingly I can reproduce the problem in the pure C-code, so there must be an issue with the LAPACK/BLAS on ARCHER or the matrices are simply to small to get good performance (for example because the indirect addressing in the horizontal avoids efficient reuse of higher-level caches). More specifically I get the following bandwidths (calculated by working out the data volume that is streamed through for every LU solve and dividing this by the measured runtime):
*** lowest order *** (81920 vertical columns, matrix size = 64x64, matrix bandwidth=3, 24 cores on ARCHER) Measured memory bandwidth = 0.530GB/s (per core), 12.722GB/s (per node)
OK, the matrix here is small, so plausibly there's nowhere for LAPACK to do good things, and the overhead of not just inlining a simple algorithm is hurting.
*** higher order *** (5120 vertical columns, matrix size = 384x384, matrix bandwidth=55, 24 cores on ARCHER) Measured memory bandwidth = 3.601GB/s (per core), 86.434GB/s (per node)
Good, the matrix here is pretty big, so LAPACK does a good job.
so the higher order case is running at bandwidth peak, but the lowest order case is far away from it.
That implies that the problem lies with the BLAS/LAPACK implementation, not hidden firedrake overheads.
Probably the way forward is to follow this up with ARCHER support, I can now give them a well defined test case which reproduces the problem.
Hopefully they will be able to suggest something.
Any other ideas?
At lowest order the system really is just tridiagonal, right? Can one just drop in an "inlined" tridiagonal solve for this case? Cheers, Lawrence
Hi Lawrence, ok, problem solved. If I use the in-place Thomas algorithm for the lowest order tridiagonal system instead of LAPACKs LU solver routines I get excellent memory throughput (average 3.4GB/s per core, so about peak for the full node). The time per iteration drops significantly from 0.44s to 0.24s (compared to 0.35s for the PETSc solver with hypre preconditioner), so this was really a change worth implementing! Thanks, Eike
On 12 Nov 2015, at 11:51, Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk> wrote:
On 12/11/15 11:39, Eike Mueller wrote:
Hi Lawrence,
I have effectively no idea what's going on. Does the LU solve take this long on this much data if you just call it from C?
I just carried out exactly that experiment (see the C-code in firedrake-multigridpaper/code/test_lusolve). Basically in each vertical column I do an LU solve with exactly the same matrix size and bandwidth as in the firedrake code. I use the same horizontal grid size as for the firedrake code and also use the compiler/linker flags which are printed out at the end of the run by PETSc, so I’m confident that the C-code does exactly the same as the code autogenerated by firedrake (only difference is that I initialise the matrix with random values, but make it diagonally dominant to avoid excessive pivoting).
Interestingly I can reproduce the problem in the pure C-code, so there must be an issue with the LAPACK/BLAS on ARCHER or the matrices are simply to small to get good performance (for example because the indirect addressing in the horizontal avoids efficient reuse of higher-level caches). More specifically I get the following bandwidths (calculated by working out the data volume that is streamed through for every LU solve and dividing this by the measured runtime):
*** lowest order *** (81920 vertical columns, matrix size = 64x64, matrix bandwidth=3, 24 cores on ARCHER) Measured memory bandwidth = 0.530GB/s (per core), 12.722GB/s (per node)
OK, the matrix here is small, so plausibly there's nowhere for LAPACK to do good things, and the overhead of not just inlining a simple algorithm is hurting.
*** higher order *** (5120 vertical columns, matrix size = 384x384, matrix bandwidth=55, 24 cores on ARCHER) Measured memory bandwidth = 3.601GB/s (per core), 86.434GB/s (per node)
Good, the matrix here is pretty big, so LAPACK does a good job.
so the higher order case is running at bandwidth peak, but the lowest order case is far away from it.
That implies that the problem lies with the BLAS/LAPACK implementation, not hidden firedrake overheads.
Probably the way forward is to follow this up with ARCHER support, I can now give them a well defined test case which reproduces the problem.
Hopefully they will be able to suggest something.
Any other ideas?
At lowest order the system really is just tridiagonal, right? Can one just drop in an "inlined" tridiagonal solve for this case?
Cheers,
Lawrence
On 12/11/15 14:45, Eike Mueller wrote:
Hi Lawrence,
ok, problem solved. If I use the in-place Thomas algorithm for the lowest order tridiagonal system instead of LAPACKs LU solver routines I get excellent memory throughput (average 3.4GB/s per core, so about peak for the full node).
The time per iteration drops significantly from 0.44s to 0.24s (compared to 0.35s for the PETSc solver with hypre preconditioner), so this was really a change worth implementing!
Ah, nice! Thanks, Lawrence
participants (2)
- 
                
                Eike Mueller
- 
                
                Lawrence Mitchell