On Saturday, July 12, 2014, Rathgeber, Florian <f.rathgeber10@imperial.ac.uk> wrote:
On 12/07/14 08:06, David Ham wrote:
>
> Those look like interesting results.
>
> Do we have any idea why we are slow on CUDA on the RHS?

The reason is that afaict the kernel uses too many resources: 57
registers and 28.047K of shared memory. We therefore get a theoretical
occupancy of 6.25% i.e. only 1/16 SMX units on the 680 can be used. That
is up to 64 DP FMAs at half the clock speed of a Xeon core...


oK. that's a good analysis, make sure you give yourself time to give it. It'll make the audience realise you really know what you are doing.
 
> Do we have any indication of actual speed compared with peak flops or
> bandwidth?

I haven't been able to figure out how to drive the Nvidia profiler to
record the required metrics, but we should be able to get those somehow.

Florian

> Regards,
>
> David
>
>
>
> On Friday, July 11, 2014, Rathgeber, Florian
> <f.rathgeber10@imperial.ac.uk <mailto:f.rathgeber10@imperial.ac.uk>> wrote:
>
>     I have now added performance results for advection assembly (matrix +
>     RHS). We can still claim (performance) portability to some degree across
>     sequential, OpenMP and CUDA.
>
>     On 10/07/14 11:23, David Ham wrote:
>     > I'm concerned that there are no performance results at all. Do we not
>     > even have CPU results?
>     >
>     > On Wednesday, July 9, 2014, Rathgeber, Florian
>     > <f.rathgeber10@imperial.ac.uk <javascript:;>
>     <mailto:f.rathgeber10@imperial.ac.uk <javascript:;>>> wrote:
>     >
>     >     Draft slides for my 15min PDESoft talk on PyOP2 next week are at
>     >     http://kynan.github.io/pdesoft2014
>     >
>     >     Any comments and suggestions much appreciated.
>     >
>     >     Florian



--
Dr David Ham
Departments of Mathematics and Computing
Imperial College London

http://www.imperial.ac.uk/people/david.ham