Further to various discussions of reproducibility, I thought you might be interested in this paper: http://hal.archives-ouvertes.fr/docs/00/94/93/55/PDF/superaccumulator.pdf Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity of floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filtering stage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms. Paul
Hi Paul, I seem to be missing some context for this discussion, or did I just miss an email? Or, more likely, a group meeting! Anyway, what's the story? Cjc On 21 Jun 2014 16:15, "Kelly, Paul H J" <p.kelly@imperial.ac.uk> wrote:
Further to various discussions of reproducibility, I thought you might be interested in this paper:
http://hal.archives-ouvertes.fr/docs/00/94/93/55/PDF/superaccumulator.pdf
Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures
Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity of floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filtering stage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. *We show that the numerical reproducibility* *and bit-perfect accuracy can be achieved at no additional cost *for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.
Paul
On 21 Jun 2014, at 19:43, Colin Cotter wrote: Hi Paul, I seem to be missing some context for this discussion, or did I just miss an email? Or, more likely, a group meeting! Anyway, what's the story? Cjc Hi Colin I just thought you might be interested - it appears to show that we should be able to get bit-accurate summations in parallel, at low cost. So, interpreting optimistically (possibly prematurely), it means that it might be possible to make PyOP2 fully deterministic by default. In contrast, at present in PyOP2 the precise association of floating-point adds may vary due to thread to thread races, or when the mesh is recoloured or repartitioned. I think the paper's claim applies to global reductions; I'm not sure it scales nicely to addtos, though there might be other solutions for that. Paul On 21 Jun 2014 16:15, "Kelly, Paul H J" <p.kelly@imperial.ac.uk<mailto:p.kelly@imperial.ac.uk>> wrote: Further to various discussions of reproducibility, I thought you might be interested in this paper: http://hal.archives-ouvertes.fr/docs/00/94/93/55/PDF/superaccumulator.pdf Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity of floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filtering stage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms. Paul _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk<mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake
It's interesting that it is possible. I find the quest for bit-reproducibility to be somewhat misguided; it's probably useful for debugging parallel codes but I think that if it is something that is being required by scientific users then they are probably thinking about their problem the wrong way. To me it is similar to worrying about dynamic typing... --cjc ======================== Hi Colin I just thought you might be interested - it appears to show that we should be able to get bit-accurate summations in parallel, at low cost. So, interpreting optimistically (possibly prematurely), it means that it might be possible to make PyOP2 fully deterministic by default. In contrast, at present in PyOP2 the precise association of floating-point adds may vary due to thread to thread races, or when the mesh is recoloured or repartitioned. I think the paper's claim applies to global reductions; I'm not sure it scales nicely to addtos, though there might be other solutions for that. Paul
Hi Paul, Colin, I took a look at the paper and the result is certainly impressive. However, it is interesting that they chose to motivate their work with a discussion about exascale computing. A concern that several people have voiced is the difficulty of obtaining bit-reproducibility for integer computations when one goes up to exascale. At this level undetected bit-flips are a very real possibility. Indeed, this has lead to the development of fault-tolerant variants of common algorithms, e.g, http://arxiv.org/pdf/1206.1390v1.pdf. My general view is that if you want bitwise reproducibility that you shouldn't be using floating point. (With wanting and needing being two very different things!) When people do claim bit-reproducibility for a floating point code it is usually for that specific binary on that specific system with that specific processor. Change any of these and the result may change. This is of questionable value in the real world (although it seems to placate those in the financial sector). If you're going to shoot for bit-reproducibility it should be independent of parallelization, compiler choice, support library versions, etc, etc. However, I know of no non-trivial examples of codes claiming this. Regards, Freddie. ________________________________ From: Kelly, Paul H J Sent: 22 June 2014 13:49 To: firedrake Cc: Witherden, Freddie Subject: Re: [firedrake] reproducible parallel summation On 21 Jun 2014, at 19:43, Colin Cotter wrote: Hi Paul, I seem to be missing some context for this discussion, or did I just miss an email? Or, more likely, a group meeting! Anyway, what's the story? Cjc Hi Colin I just thought you might be interested - it appears to show that we should be able to get bit-accurate summations in parallel, at low cost. So, interpreting optimistically (possibly prematurely), it means that it might be possible to make PyOP2 fully deterministic by default. In contrast, at present in PyOP2 the precise association of floating-point adds may vary due to thread to thread races, or when the mesh is recoloured or repartitioned. I think the paper's claim applies to global reductions; I'm not sure it scales nicely to addtos, though there might be other solutions for that. Paul On 21 Jun 2014 16:15, "Kelly, Paul H J" <p.kelly@imperial.ac.uk<mailto:p.kelly@imperial.ac.uk>> wrote: Further to various discussions of reproducibility, I thought you might be interested in this paper: http://hal.archives-ouvertes.fr/docs/00/94/93/55/PDF/superaccumulator.pdf Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity of floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbers efficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filtering stage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms. Paul _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk<mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake
On 22 Jun 2014, at 14:32, Witherden, Freddie <freddie.witherden08@imperial.ac.uk> wrote:
Hi Paul, Colin,
I took a look at the paper and the result is certainly impressive. However, it is interesting that they chose to motivate their work with a discussion about exascale computing. A concern that several people have voiced is the difficulty of obtaining bit-reproducibility for integer computations when one goes up to exascale. At this level undetected bit-flips are a very real possibility. Indeed, this has lead to the development of fault-tolerant variants of common algorithms, e.g,http://arxiv.org/pdf/1206.1390v1.pdf.
So I'm still not sold on the argument that exascale machines will have significantly more undetected bit flips than current hardware.
My general view is that if you want bitwise reproducibility that you shouldn't be using floating point. (With wanting and needing being two very different things!) When people do claim bit-reproducibility for a floating point code it is usually for that specific binary on that specific system with that specific processor. Change any of these and the result may change. This is of questionable value in the real world (although it seems to placate those in the financial sector). If you're going to shoot for bit-reproducibility it should be independent of parallelization, compiler choice, support library versions, etc, etc. However, I know of no non-trivial examples of codes claiming this.
FWIW, I /believe/ that the Unified Model is bit-reproducible across different parallel decompositions (I'm not sure about different compilations of the same code), and that's a very non-trivial code. One possible scenario where you might wish for bit-reproducibility in an actual simulation run is computing adjoints of chaotic systems with large Lyapunov exponents, where you really would like your replayed forward model to be deterministic. However, I am by no means an expert, Patrick (if he's lurking) may have other opinions. Lawrence
FWIW, I /believe/ that the Unified Model is bit-reproducible across different parallel decompositions (I'm not sure about >different compilations of the same code), and that's a very non-trivial code.
Yes, but they take a performance hit (because it limits the parallel preconditioners that they can use), and a large maintenance hit (some of the performance team seem to be entirely focussed on maintaining bit-reproducibility).
One possible scenario where you might wish for bit-reproducibility in an actual simulation run is computing adjoints of >chaotic systems with large Lyapunov exponents, where you really would like your replayed forward model to be >deterministic. However, I am by no means an expert, Patrick (if he's lurking) may have other opinions.
No, bit-reproducibility doesn't save you from numerical losses when differentiating chaotic maps. --cjc
On 23/06/14 10:25, Lawrence Mitchell wrote:
One possible scenario where you might wish for bit-reproducibility in an actual simulation run is computing adjoints of chaotic systems with large Lyapunov exponents, where you really would like your replayed forward model to be deterministic. However, I am by no means an expert, Patrick (if he's lurking) may have other opinions.
If your forward model is chaotic, your adjoint solve will blow up exponentially backwards in time. (After all, the tangent linearisation blows up forwards in time.) In that case, there may be sensible derivatives of functionals, but using the standard linearisation argument won't compute them, and since the whole endeavour is pointless I don't think that matching bit-for-bit reproducibility in the forward model is necessary or recommended. There's been some very interesting work by Qiqi Wang from MIT on the topic of computing gradients of chaotic systems, see e.g. http://arxiv.org/abs/1204.0159. Cheerio, Patrick
participants (6)
- 
                
                Colin Cotter
- 
                
                Cotter, Colin J
- 
                
                Kelly, Paul H J
- 
                
                Lawrence Mitchell
- 
                
                Patrick Farrell
- 
                
                Witherden, Freddie