Hi Paul, I seem to be missing some context for this discussion, or did I just miss an email? Or, more likely, a group meeting! Anyway, what's the story?
Cjc
Further to various discussions of reproducibility, I thought you might be interested in this paper:
Full-Speed Deterministic Bit-AccurateParallel Floating-Point Summationon Multi- and Many-Core Architectures
Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations,especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativityof floating-point operations. We introduce a solution to compute deterministic sums of floating-point numbersefficiently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a filteringstage that uses fast vectorized floating-point expansions; an accumulation stage based on superaccumulators in ahigh-radix carry-save representation. We present implementations on recent Intel desktop and server processors,on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibilityand bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.
Paul