Designing high-performance concurrent data structures whose operation
throughput scales on multicore hardware is difficult. Concurrent
implementations of FIFO queues, for example, seem to require
algorithms that efficiently increase the potential for parallel access
by implementing semantically relaxed rather than strict FIFO queues
where elements may be returned in some out-of-order fashion. However,
we show experimentally that the on average shorter execution time of
enqueue and dequeue operations of fast but relaxed implementations may
offset the effect of semantical relaxations making them appear as
behaving more FIFO than strict but slow implementations. Our key
assumption is that ideal concurrent data structure operations should
execute in zero time. We define two metrics, element-fairness and
operation-fairness, to measure the degree of element and operation
reordering, respectively, assuming operations take zero time.
Element-fairness quantifies the deviation from FIFO queue semantics
had all operations executed in zero time. With this metric even
strict implementations of FIFO queues are not FIFO.
Operation-fairness helps explaining element-fairness by quantifying
operation reordering when considering the actual time operations took
effect relative to their invocation time. In our experiments, the
effect of poor operation-fairness of strict but slow implementations
on element-fairness may outweigh the effect of semantical relaxation
of fast but relaxed implementations.
This is an important critique of linearizability as a correctness criterion, showing how a linearizable sequence of operations can nevertheless appear to violate intuitive concepts of ordering. Of course, any parallel application that pushes all of its data through a single queue should be redesigned if performance or scalability is a primary goal, but such designs are nevertheless quite common, so it is valuable to explore their consequences.
From the abstract, it is not clear that operation fairness and element fairness are different. From section 2.1, element fairness is a measure of apparent misordering, the number of elements that overtake the element in question. From section 2.2, operation fairness is similar, but based on operations rather than elements. Isn’t the sum of element fairness over all elements necessarily equal to the sum of operation fairness over all operations for a given history? Or is the difference that (for example) an enqueue operation can overtake a dequeue operation, but an element being enqueued cannot possibly overtake an operation being dequeued? Either way, it would be very helpful to readers if you were to call out any differences between the two measures.
In Figure 4a vs. 4b, why is 4b not similar to the first 24 threads worth of 4a? Different latency properties of the two hardware systems? Also, the producer-consumer benchmark has multiple producers and consumers for a given queue, with half the threads being producers and the other half being consumers?
It is important to note that caller-visible misordering can also happen before and after the individual operations, in addition to within the individual operations. All the care and expense of maintaining strict FIFO ordering is lost immediately after the dequeue completes, and similarly a strict FIFO enqueue operation does nothing to guarantee anything about ordering of operations preceding the enqueue. But from the user’s perspective, there is no difference between misordering that occurs immediately preceding/following an operation and misordering within that operation. This line of reasoning should cause one to question the value of parallel strict FIFO queues, regardless of environment or workload.
Haas et al. assess the FIFOness of concurrent queues by assuming zero
execution time for all operations. They introduce the metrics element-fairness
and operation-fairness for this assessment and show in their argumentation and
experiments that queues with relaxed semantics can have properties that could
be considered more FIFO than queues with strict semantics.
I like the presented idea because it gives us a way to quantify properties
such as FIFOness in concurrent systems. It would be very interesting to apply
it to Hendler’s elimination stack  as well as, to assess it’s LIFO
properties. These or similar metrics would be useful characteristics for
concurrent data structures in addition to the asymptotic complexity of their
operations, and the operations’ concurrent scalability properties. It would
help developers to chose the appropriate data structure for the task at hand.
However, I wonder how generalizable/applicable this approach is given the
observations of Sec. 3.3, where the authors observe that the measured fairness
depends significantly on the hardware, and potentially on the synchronization
operations of the surrounding code.
For the experiments, I wonder how the logging is done. What kind of approach
is used, how is the data recorded, and how does this instrumentation influence
the measured results? Furthermore, how significant are the
fluctuations/measurement errors between different runs?
I would not start to discussed the details of the approach in the introduction
already. Instead, I would discuss the high-level context and concentrate on
motivation and problem as well as contributions.
Sec. 3: I assume the array for the Flat Combining FIFO queue is padded to
account for cache lines?
 Hendler, D.; Shavit, N. & Yerushalmi, L. (2004), A Scalable Lock-free
Stack Algorithm, in ‘Proceedings of the sixteenth annual ACM symposium on
Parallelism in algorithms and architectures’, pp. 206-215.
The RACES’12 workshop explores the premise that synchronization is the enemy of scalability.
This paper goes further, arguing that synchronization is the enemy of determinism,
making the case that (some) FIFO queue implementations that loosen ordering constraints actually deliver more deterministic ordering.
I found this conclusion to be both plausible and wonderfully paradoxical.
For example, the paper compares the Michael-Scott queue, a strict-FIFO queue in which each thread using compare-and-swap on the queue,
with a Scal queue, a relaxed-FIFO queue consisting of multiple partial queues, and a load-balancer.
If one assumes that the intended order of, say, the enqueue operations follows the order of making the enqueue requests,
then (the paper shows) the relaxed-FIFO queue actually results in a better-ordered result than the strict-FIFO queue.
This surprising result occurs because contention for the atomically-accessed words of the strict-FIFO queue cause the
operations to retry, creating varying delays between the start of the enqueue operations and the points in time when they succeed and thus, take effect.
The paper shows a pleasingly paradoxical result that is right in line with the aim of the workshop, and I believe it will be a strong contribution.
Its result likely generalizes, as well.
I do have a few suggestions:
There were one or two places where extra figures might add clarity, instead of using one figure to present several concepts.
I would move a lot of the formal treatment to an Appendix, since I found the informal portions clear enough, and thought that the formal explanations of the last five or so of six symmetric metrics were distracting. In other words, I suggest inline formal treatment of one of the metrics, with appeals to symmetry for the rest, and appendices for the details.
The paper deals well with the effect of contention, but I would have liked a few general sentences in the conclusion about the general role that
the degree of contention plays when designing parallel data structures. When choosing an implementation of a data structure (FIFO queue here),
the expected degree of contention for that shared structure is a key factor, and this paper adds a dimension to those tradeoffs.
The metrics in the paper concern ordering. It would be interesting for future work to also measure time. In other words, if two operations
are wrongly-ordered, it would be interesting to measure by how much time they were wrongly-ordered.
On page 2: Paragraph 1 in the sequential history definition contains: “No other operations appear in s.” That sentence is confusing,
because then the paragraph can be interpreted as s contains only a single operation, m(i).
On page 3: “We say that the element-fairness is better if the element-lateness and element-age is lower.”
Since the text goes on to say that if is lower, the other must be lower, the conjunction in the above sentence seems redundant, and it