On 22 Jun 2014, at 14:32, Witherden, Freddie <freddie.witherden08@imperial.ac.uk> wrote:
Hi Paul, Colin,
I took a look at the paper and the result is certainly impressive. However, it is interesting that they chose to motivate their work with a discussion about exascale computing. A concern that several people have voiced is the difficulty of obtaining bit-reproducibility for integer computations when one goes up to exascale. At this level undetected bit-flips are a very real possibility. Indeed, this has lead to the development of fault-tolerant variants of common algorithms, e.g,http://arxiv.org/pdf/1206.1390v1.pdf.
So I'm still not sold on the argument that exascale machines will have significantly more undetected bit flips than current hardware.
My general view is that if you want bitwise reproducibility that you shouldn't be using floating point. (With wanting and needing being two very different things!) When people do claim bit-reproducibility for a floating point code it is usually for that specific binary on that specific system with that specific processor. Change any of these and the result may change. This is of questionable value in the real world (although it seems to placate those in the financial sector). If you're going to shoot for bit-reproducibility it should be independent of parallelization, compiler choice, support library versions, etc, etc. However, I know of no non-trivial examples of codes claiming this.
FWIW, I /believe/ that the Unified Model is bit-reproducible across different parallel decompositions (I'm not sure about different compilations of the same code), and that's a very non-trivial code. One possible scenario where you might wish for bit-reproducibility in an actual simulation run is computing adjoints of chaotic systems with large Lyapunov exponents, where you really would like your replayed forward model to be deterministic. However, I am by no means an expert, Patrick (if he's lurking) may have other opinions. Lawrence