reproducible parallel summation

21 Jun 2014

      Further to various discussions of reproducibility, I thought you might be interested in this paper:
  http://hal.archives-ouvertes.fr/docs/00/94/93/55/PDF/superaccumulator.pdf

Full-Speed Deterministic Bit-Accurate
Parallel Floating-Point Summation
on Multi- and Many-Core Architectures

Abstract. On modern multi-core, many-core, and heterogeneous architectures, ﬂoating-point computations,
especially reductions, may become non-deterministic and thus non-reproducible mainly due to non-associativity
of ﬂoating-point operations. We introduce a solution to compute deterministic sums of ﬂoating-point numbers
efﬁciently and with the best possible accuracy. Our multi-level algorithm consists of two main stages: a ﬁltering
stage that uses fast vectorized ﬂoating-point expansions; an accumulation stage based on superaccumulators in a
high-radix carry-save representation. We present implementations on recent Intel desktop and server processors,
on Intel Xeon Phi accelerator, and on AMD and NVIDIA GPUs. We show that the numerical reproducibility
and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to
90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

Paul

Kelly, Paul H J

Colin Cotter

Kelly, Paul H J

Cotter, Colin J

Witherden, Freddie

Lawrence Mitchell

Cotter, Colin J

Patrick Farrell

tags

participants (6)