On Saturday, July 12, 2014, Rathgeber, Florian <f.rathgeber10@imperial.ac.uk> wrote:

On 12/07/14 08:06, David Ham wrote:
>
> Those look like interesting results.
>
> Do we have any idea why we are slow on CUDA on the RHS?

The reason is that afaict the kernel uses too many resources: 57
registers and 28.047K of shared memory. We therefore get a theoretical
occupancy of 6.25% i.e. only 1/16 SMX units on the 680 can be used. That
is up to 64 DP FMAs at half the clock speed of a Xeon core...

oK. that's a good analysis, make sure you give yourself time to give it. It'll make the audience realise you really know what you are doing.

> Do we have any indication of actual speed compared with peak flops or
> bandwidth?

I haven't been able to figure out how to drive the Nvidia profiler to
record the required metrics, but we should be able to get those somehow.

Florian

> Regards,
>
> David
>
>
>
> On Friday, July 11, 2014, Rathgeber, Florian
> <f.rathgeber10@imperial.ac.uk <mailto:f.rathgeber10@imperial.ac.uk>> wrote:
>
> I have now added performance results for advection assembly (matrix +
> RHS). We can still claim (performance) portability to some degree across
> sequential, OpenMP and CUDA.
>
> On 10/07/14 11:23, David Ham wrote:
> > I'm concerned that there are no performance results at all. Do we not
> > even have CPU results?
> >
> > On Wednesday, July 9, 2014, Rathgeber, Florian
> > <f.rathgeber10@imperial.ac.uk <javascript:;>
> <mailto:f.rathgeber10@imperial.ac.uk <javascript:;>>> wrote:
> >
> > Draft slides for my 15min PDESoft talk on PyOP2 next week are at
> > http://kynan.github.io/pdesoft2014
> >
> > Any comments and suggestions much appreciated.
> >
> > Florian

Dr David Ham
Departments of Mathematics and Computing
Imperial College London

http://www.imperial.ac.uk/people/david.ham