At RACES’12, we want you, the attendees, to have the opportunity to choose Continued how your time is spent. See the Call for Participation for an explanation.
Below, you will find each submission, the program committee’s reviews, and a default time allocation based on those reviews (the allocation includes both presentation and discussion time). In addition, there are additional time slots proposed for general discussion. Please vote by adjusting the sliders to indicate the allocation that you feel would maximize your return on the time you will be spending in the workshop. When the sliders are set to your satisfaction, please hit the Submit Vote button at the bottom of the page.
We plan to start with a brief introduction during which each attendee will concisely introduce him- or herself: name, institution, and one sentence about position or interest. If it garners sufficient votes, we’ll also include an introductory lightning session in which each presenter will present a strictly more about the author limited 30-second sneak preview (cf. CHI “Madness”).
If you will be attending our workshop, please help shape it to best serve you — vote today! Questions?
We present techniques for obtaining acceptable unsynchronized parallel
computations. Even though these techniques may
generate interactions (such as data races) that the current rigid value system
condemns as incorrect, they are engineered to 1) preserve key data
structure consistency constraints while 2) producing a result that is
accurate enough, often enough.
And because they contain no
synchronization, they eliminate the synchronization overhead,
parallelism reduction, and failure propagation typically associated
with synchronization mechanisms.
We also identify a class of computations that interact well with our
techniques. These computations combine multiple contributions to
obtain a composite result. Redundancy in the contributions enables the
unsynchronized computation to produce an acceptably accurate result
even though it may, in effect, drop contributions. Thus our
unsynchronized techniques (either directly or indirectly) are often
acceptably accurate for the same reason as other successful
approximate computing techniques such as task skipping and loop
Modern mainstream programming languages distinguish between
“atomic” (or sometimes “volatile”) variables, and ordinary
data. Atomic accesses are treated as synchronization
constructs, and support concurrent access with well-defined
semantics. In contrast, concurrent accesses to ordinary data,
if at least one access is an update, constitute a data race.
Code with data races does not have well-defined semantics.
Usually such code may fail completely when recompiled or
run on a different operating system version. In C and C++
data races are equivalent to assignments to out-of-bounds array
elements; any data race can result in arbitrary failures,
including application crashes, hangs, and inexplicably and
completely wrong answers.
These language specifications, combined with implementation
realities, make it unsafe to exploit “benign” data races
to obtain performance, even if we are willing to tolerate approximate
answers. Furthermore, even if we happen to get
lucky, and code with data races happens to execute correctly
with our current compiler, data races provide at best inconsequential
performance advantages over atomics. In fact, there
are interesting, and probably common, cases in which data
races provide only a minor performance advantage, even
over pervasive locking to avoid them, at sufficiently large
core counts. We demonstrate such a case.
Sasa Misailovic, Stelios Sidiroglou and Martin Rinard
We present Dubstep, a novel system that uses the
find-tranform-navigate paradigm to automatically explore new
parallelization opportunities in already parallelized
(fully-synchronized) programs by opportunistically relaxing
synchronization primitives. This set of transformations generates a
space of alternative, possibly non-deterministic, parallel programs with
varying performance and accuracy characteristics. The freedom to
generate parallel programs whose output may differ (within statistical
accuracy bounds) from the output of the original program enables a
significantly larger optimization space. Dubstep then searches this
space to find a parallel program that will, with high likelihood,
produce outputs that are acceptably close to the outputs that the
original, fully synchronized parallel program would have produced.
Initial results from our benchmarked application show that Dubstep can
generate acceptably accurate and efficient versions of a parallel
program that occupy different positions in performance/accuracy trade
Andreas Haas, Christoph Kirsch, Hannes Payer and Michael Lippautz
Designing high-performance concurrent data structures whose operation
throughput scales on multicore hardware is difficult. Concurrent
implementations of FIFO queues, for example, seem to require
algorithms that efficiently increase the potential for parallel access
by implementing semantically relaxed rather than strict FIFO queues
where elements may be returned in some out-of-order fashion. However,
we show experimentally that the on average shorter execution time of
enqueue and dequeue operations of fast but relaxed implementations may
offset the effect of semantical relaxations making them appear as
behaving more FIFO than strict but slow implementations. Our key
assumption is that ideal concurrent data structure operations should
execute in zero time. We define two metrics, element-fairness and
operation-fairness, to measure the degree of element and operation
reordering, respectively, assuming operations take zero time.
Element-fairness quantifies the deviation from FIFO queue semantics
had all operations executed in zero time. With this metric even
strict implementations of FIFO queues are not FIFO.
Operation-fairness helps explaining element-fairness by quantifying
operation reordering when considering the actual time operations took
effect relative to their invocation time. In our experiments, the
effect of poor operation-fairness of strict but slow implementations
on element-fairness may outweigh the effect of semantical relaxation
of fast but relaxed implementations.
In shared memory multiprocessors utilizing invalidation-based
coherence protocols, cache misses caused by inter-processor
communication are a dominant source of processor stall cy-
cles for many applications. We explore a novel coherence
protocol implementation called edge-chasing delayed con-
sistency (ECDC) that mitigates some of the performance
degradation caused by this class of misses. Edge-chasing
delayed consistency allows a processor to continue to non-
speculatively continue reading a cache line after receiving an
invalidation from another core, without changing the consis-
tency model oered to programmers. ECDC is not a new
programming model; it is simply a new way of implementing
memory consisting models.
The edge-chasing protocol identies those cache lines that
can safely be read after being invalidated by dynamically
constructing the constraint graph observed during
the particular execution of an application, and delaying in-
validation of the line until a read to the line would create a
cycle in that graph, and therefore be inconsistent. Because
a cache line’s useful lifetime expires only when it is strictly
necessary as dictated by the consistency model, the line can
be used signicantly longer than in conventional protocols.
While the idea of using stale data for as long as possible
is enticing, our study shows that the benefits of such delay
are small, and that the majority of these delayed invalidation
benefits come from mitigating the false sharing problem,
rather than any tolerance of races or an application’s ability
to consume stale data in a productive manner. Since false
sharing can be eectively solved a number of other ways,
through either padding in software or less complicated hard-
ware mechanisms, it is questionable whether the
advantages of the ECDC mechanism outweigh the hardware
complexity and overheads.
Programmers like to believe we live and program in a sequentially consistent world. Hardware developers have long since abandoned sequential consistency in order to develop higher performance computers. This paper advocates relaxing consistency guarantees from sequentially consistent to causally consistent. Doing so offers tremendous performance benefits while still preserving important correctness properties.
Researchers have recently begun to explore a new class of program transformations called approximate program transformations. These program transformations take an existing program and produce a new, relaxed approximate program that trades accuracy of its results for increased performance.
In this paper, we explore how developers can use relational reasoning to verify relative properties of relaxed approximate programs in that if their original program satisfies a property (such as memory safety), then the relaxed program also satisfies the property. Such relational reasoning enables developers to transfer their reasoning about the original program over to verify the relaxed.
My parallel-programming education began in earnest when I joined Sequent Computer Systems in late 1990. This education was both brief and effective: within a few short years, my co-workers and I were breaking new ground. Nor was I alone: Sequent habitually hired new-to-parallelism engineers and had them producing competent parallel code within a few months. Nevertheless, more than two decades later, parallel programming is perceived to be difficult to teach and learn. Is parallel programming an exception to the typical transitioning of technology from impossible to expert-only to routine to unworthy of conscious thought?
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman
As we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Lakshminarayanan Renganarayanan, Vijayalakshmi Srinivasan, Ravi Nair and Daniel Prener
We propose a methodology for programming with relaxed
synchronization. This methodology can help to reduce, and
in some cases completely eliminate, synchronization over-
head in parallel programs without sacricing the quality of
General Discussion: Morning
Time slot: 40min
General Discussion: Afternoon
Time slot: 40min
Your vote has been received.
Thank you for your participation! We are looking forward to see you at the workshop.