Position paper: Nondeterminism is unavoidable, but data races are pure evil

Hans-J. Boehm

Abstract

Modern mainstream programming languages distinguish between
“atomic” (or sometimes “volatile”) variables, and ordinary
data. Atomic accesses are treated as synchronization
constructs, and support concurrent access with well-defined
semantics. In contrast, concurrent accesses to ordinary data,
if at least one access is an update, constitute a data race.
Code with data races does not have well-defined semantics.
Usually such code may fail completely when recompiled or
run on a different operating system version. In C and C++
data races are equivalent to assignments to out-of-bounds array
elements; any data race can result in arbitrary failures,
including application crashes, hangs, and inexplicably and
completely wrong answers.

These language specifications, combined with implementation
realities, make it unsafe to exploit “benign” data races
to obtain performance, even if we are willing to tolerate approximate
answers. Furthermore, even if we happen to get
lucky, and code with data races happens to execute correctly
with our current compiler, data races provide at best inconsequential
performance advantages over atomics. In fact, there
are interesting, and probably common, cases in which data
races provide only a minor performance advantage, even
over pervasive locking to avoid them, at sufficiently large
core counts. We demonstrate such a case.

Reviews

Tom Van Cutsem Rating: Star rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee Member Confidence: high 3 / 4

The paper discusses the alleged benefits, and the many issues with data races, i.e. unsynchronized concurrent accesses to the same location in memory. The author points out that data races have ill-defined language semantics (for good reason), that their behavior is non-portable and that their performance gain becomes negligible as the core count rises. Current and upcoming support for “atomics” further reduces the need for unbridled data races as a performance “tweak”.

In my opinion, this paper effectively dispels the myth that data races are a legitimate tool to boost an application’s performance (especially w.r.t. scalability). To me, the graph in Figure 1 was very revealing. In hindsight the results seem “obvious”, in the sense that even an unsynchronized memory access is competing for limited hardware resources and writes remain subject to cache coherence protocols, so that at larger core counts, this “inherent” cost starts to outweigh the “overhead” of synchronization.

That said, I’m sure this paper would generate interesting discussions at the RACES workshop. Clearly the CfP contained statements along the lines of “we might even need to embrace data races” to provoke exactly this type of discussion. The current paper would be an excellent way to bootstrap a discussion about how far we can take the “unsynchronized” approach to parallel algorithms (on mainstream hardware, using mainstream languages).

Doug Kimelman Rating: Star rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee Member Confidence: high 3 / 4

Thank you to the author for this interesting submission. I would strongly recommend that
it be presented at the workshop — it undoubtedly would lead to spirited debate :-) There
is no substitute for dissenting opinion and well-argued challenges in refining and honing
an idea to the point where either the important essence is distilled out of it, or …
there’s nothing left :-)

The author is certainly strong in his belief :-) and unequivocal in stating it: no good
can come from permitting data races in a misguided attempt to derive benefits in terms of
performance and scalability in parallel systems.

Fundamentally, the author makes two orthogonal arguments:
1. No data race can be guaranteed to always be benign. Arising out of language definitions
and compiler / optimizer implementations, a data race could cause arbitrary failure at any
time.
2. (In runs where a data race happens not to cause a failure) Very little performance or
scalability is gained anyway vs. synchronization with contemporary language features and
hardware mechanisms.

Further, thank you to the author for elaborating on the finer point: there are many forms
of non-determinism other than those related to data races, and some of those are
inescapable / valuable in practice.

Summary of the paper:

No data race is reliably benign. Data races invalidate assumptions that are the basis upon
which optimizers transform code. Any data race can result in arbitrary failure, just the
same as assignment to out-of-bounds array elements in C and C++ programs. With current C
and C++ language definitions and compiler implementations, the behavior of programs with
data races is totally undefined — often referred to as “catch-fire” semantics :-) The
behavior of programs containing data races is particularly variable and unpredictable
across different and more contemporary compilers / optimizers and operating system
versions. As compilers / optimizers increasingly leverage better-defined memory models,
the odds of “favorable compilation” of data races will decline.

Further, permitting data races does not yield significant scalability or performance
improvements with respect to C11, C++11, and Java mechanisms for preventing races.
Compared to access to “atomics”, allowing data races and hence unsynchronized access to
ordinary data, yields at best inconsequential scalability and performance advantages.
Even compared to pervasive locking, in the case of sufficiently large core counts, there
are interesting and probably common cases where there are only minor performance advantages.

Certain kinds of non-determinism are natural / inherent in multithreaded applications due
to scheduling, memory allocation, etc., and unlike data races are acceptable. Forcing
deterministic execution in these cases is impractical / unrealistic, with a significant
performance penalty, especially at higher core counts.

In experimentation with a parallel Sieve of Eratosthenes implementation, lock-based
synchronization is shown not to limit scalability. And any performance impact of locking
is overcome by scaling. Ultimately, for either lock-based synchronization or unsynchronized
access, memory bandwidth is the limiting factor. In general, locking won’t qualitatively
hurt scalability, because even without synchronization, cache coherency protocols still
effectively serialize accesses.

Modern language mechanisms such as C++11 and C11 atomic objects, and Java volatile fields
and java.util.concurrent.atomic, along with choices between sequential consistency and
weaker access ordering / more relaxed consistency (e.g. memory_order_{relaxed,release,
acquire}), [will] allow data races to be prevented in many cases with little (almost zero)
performance penalty vs. unsynchronized access to ordinary data. There is a huge increase
in the complexity of programming, but the burden is still far less than that required to
understand data race behavior.

Reviewer’s additional remarks:

The paper would be enhanced by using the sieve example from “Part 2″ (concerning
performance and scalability) as a running example through “Part 1″ (concerning
compilation / optimization) as well. It would be good to show “epic fail” scenarios
that could arise, even for the quite benign-appearing data race with respect to a byte
representing a sieve element.

Further, for that data race, how difficult would it be to extend the language / optimizer
/ architecture, in ways that would preclude such failures, leaving only the “intended
non-determinism”? (… leaving separate any discussion about why or whether someone would
want to have such a data race.)

Finally, *if* it were possible to have compilers / optimizers, storage hierarchies, etc.
*not* get involved (in a problematic way), are there not data races where the “intended
non-determinism” is quite simple to reason about, and certainly less challenging /
complex for a programmer than the upcoming access ordering / consistency mechanisms
discussed in the paper?

For the parallel sieve experiment, it would be good to include more of a discussion of
the details of the architecture of the system: the storage hierarchy, cache structure and
coherence protocols, and how those might factor into observed performance.

In the discussion of the sieve experiment, the conjecture is that memory bandwidth can
be exhausted by one unsynchronized thread, so adding more cores in that case does not help
/ reduce run time / achieve scaling. Further, there is little contention, which means that
synchronization does not inhibit scalability, so the synchronized version can use a few
more cores and manage to do enough of the slower synchronized operations to get to the
point where it too exhausts memory bandwidth, and hence it runs at the same speed as the
unsynchronized version. An important question: How many classes of applications exhibit
this profile: memory-bandwidth-saturated and contention-free?

Even in classes of applications where there is only a “small” amount of contention, what
happens on systems with 1000 or 10,000 cores? Also, how will the various [relaxed]
consistency models fare at such scales vs. truly unsynchronized access?

Concerning the argument that locking for write operations doesn’t qualitatively hurt
scalability, because cache coherency protocols and the need to acquire exclusive access
to a cache line are effectively serializing accesses anyway: could it help in certain
scenarios (i.e. in those cases where a programmer feels that data races can be permitted)
to be able to disable cache coherence?? Again, at the scales mentioned above, wouldn’t
coherence protocols limit scalability?

Overall, although I don’t necessarily agree with the points of view expressed in this
paper, I enjoyed thinking about the issues it raises, and I would recommend it for
presentation / publication.

The writing does seem to be a bit rough / rushed, and could be improved in places to
more clearly express what I think is intended. I’d be happy to help out with that if
this paper is to be published in the ACM Digital Library proceedings.

Thank you very much.

Yvonne Coady and Liam Kiemele Rating: Star rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee MemberStar rating by Program Committee Member Confidence: medium 2 / 4

I really enjoyed this paper, and my sense is that it will succeed in sparking some interesting discussion at the workshop. It is grounded in a ton of details transformed into a really interesting read! Key points such as costs of avoiding data races, the impact of optimization, and the future of C11 make the paper an interesting read—and could even fuel controversy. The Sieve example makes the paper easy to follow and the in-depth related work provides a nice spectrum of contemporary results.

I think this paper will be a great contribution to the workshop.

Comments are closed.

Please fill in name and email address. We will use this data only to ensure that voting is restricted to attendees.