Replacing plyr’s ddply with data.table to improve an R script’s performance

Since quite a while, I am using R as a scripting language to generate graphs to the benchmarking results of various experiments. Since we have a new 64 core machine at the lab, I happen to run into performance issues with scripts that process the benchmark results, because the number of measurements increased significantly and the plyr library I use was designed with ease of use in mind, instead of performance. Concretely, I was experiencing script runtimes of up to 25min for a data set with roughly 50.000 measurements.

Fortunately, it is a know issue and a quick search turned up various options to improve performance. Since my initial choice for plyr was based on its intuitive DSL, I was looking for something that is similarly nice, but a lot faster. This blog post is meant to briefly document my findings for myself, and remind me how to use the data.table package, I chose.

Plyr’s ddply can be extremely slow

In an earlier post, I already outlined briefly how I use R to process benchmarking results. So far, I used ddply mostly with the transform and summarize functions to normalize measurements and to calculate averages, standard deviation, or speedup factors.

The problematic examples are for instance the following normalization and aggregation operations:

```process_data <- function(data, baseline) {
norm <- ddply(data, ~ Benchmark + Cores,
function(x) transform(x,
RuntimeRatio = Time / mean(Time[VirtualMachine == baseline]),
SpeedupRatio = mean(Time[VirtualMachine == baseline]) / Time))

# Calculate basic aggregates
stats <- ddply(norm, ~ Cores + Benchmark + VirtualMachine,
summarise,
Time.mean                 = mean(Time),
Time.stddev               =   sd(Time),

RuntimeRatio.mean         = mean(RuntimeRatio),
RuntimeRatio.geomean      = geometric.mean(RuntimeRatio),
SpeedupRatio.mean         = mean(SpeedupRatio),
SpeedupRatio.geomean      = geometric.mean(SpeedupRatio),

NumberOfMeasurements      = length(Time),

RuntimeRatio.stddev       = sd(RuntimeRatio),
RuntimeRatio.scaled_sd    = RuntimeRatio.stddev/RuntimeRatio.geomean,
RuntimeRatio.variance     = var(RuntimeRatio),
RuntimeRatio.median       = median(RuntimeRatio),
# RuntimeRatio.mean095Error = confInterval095Error(RuntimeRatio),
# RuntimeRatio.cnfIntHigh   = RuntimeRatio.mean + RuntimeRatio.mean095Error,
# RuntimeRatio.cnfIntLow    = RuntimeRatio.mean - RuntimeRatio.mean095Error,

SpeedupRatio.stddev       = sd(SpeedupRatio),
SpeedupRatio.scaled_sd    = SpeedupRatio.stddev / SpeedupRatio.geomean,
SpeedupRatio.variance     = var(SpeedupRatio),
SpeedupRatio.median       = median(SpeedupRatio),
# SpeedupRatio.mean095Error = confInterval095Error(SpeedupRatio),
# SpeedupRatio.cnfIntHigh   = SpeedupRatio.mean + SpeedupRatio.mean095Error,
# SpeedupRatio.cnfIntLow    = SpeedupRatio.mean - SpeedupRatio.mean095Error,

Time.median               = median(Time)
# Time.mean095Error         = confInterval095Error(Time),
# Time.cnfIntHigh           = Time.mean + Time.mean095Error,
# Time.cnfIntLow            = Time.mean - Time.mean095Error
)

list(norm = norm, stats = stats)
}```

On the object-table data set form earlier experiments, the `process_data(...)` function takes roughly 2.1sec to execute. Thus, is as not particular fast, but still useable. However, on the recent data sets I was using, the runtime got a severe bottleneck.

Thus, I was looking into data.table to find an alternative. The result is a rewrite of the function to the `process_data_dt(...)` function below. The result is a roughly 20x speedup. On the sample data mentioned, the runtime goes from 2.1sec down to 0.2sec. In my particular case, the runtime went from 25min to 1.3min. That’s still slow, but much more usable.

```process_data_dt <- function(data, baseline) {
norm <- data.table(data)
norm[, RuntimeRatio := Time / mean(Time[VirtualMachine == baseline]), by=list(Benchmark, Cores)]
norm[, SpeedupRatio := mean(Time[VirtualMachine == baseline]) / Time, by=list(Benchmark, Cores)]

# Calculate basic aggregates
stats   <- norm[, list(Time.mean                 = mean(Time),
Time.stddev               =   sd(Time),

RuntimeRatio.mean         = mean(RuntimeRatio),
RuntimeRatio.geomean      = geometric.mean(RuntimeRatio),
SpeedupRatio.mean         = mean(SpeedupRatio),
SpeedupRatio.geomean      = geometric.mean(SpeedupRatio),

NumberOfMeasurements      = length(Time),

RuntimeRatio.stddev       = sd(RuntimeRatio),
RuntimeRatio.variance     = var(RuntimeRatio),
# RuntimeRatio.mean095Error = confInterval095Error(RuntimeRatio),

RuntimeRatio.median       = median(RuntimeRatio),

SpeedupRatio.stddev       = sd(SpeedupRatio),
SpeedupRatio.variance     = var(SpeedupRatio),
SpeedupRatio.median       = median(SpeedupRatio),
# SpeedupRatio.mean095Error = confInterval095Error(SpeedupRatio),

Time.median               = median(Time)
# Time.mean095Error         = confInterval095Error(Time)
),
by = list(Cores, Benchmark, VirtualMachine)]

# stats[, RuntimeRatio.cnfIntHigh := RuntimeRatio.mean + RuntimeRatio.mean095Error]
# stats[, RuntimeRatio.cnfIntLow  := RuntimeRatio.mean - RuntimeRatio.mean095Error]
stats[, RuntimeRatio.scaled_sd  := RuntimeRatio.stddev / RuntimeRatio.geomean]
stats[, SpeedupRatio.scaled_sd  := SpeedupRatio.stddev / SpeedupRatio.geomean]
# stats[, SpeedupRatio.cnfIntHigh := SpeedupRatio.mean + SpeedupRatio.mean095Error]
# stats[, SpeedupRatio.cnfIntLow  := SpeedupRatio.mean - SpeedupRatio.mean095Error]
# stats[, Time.cnfIntHigh         := Time.mean + Time.mean095Error]
# stats[, Time.cnfIntLow          := Time.mean - Time.mean095Error]

list(norm=norm, stats=stats)
}
```

To conclude, data.table is a nice R package that is much faster than plyr’s ddply and as you can see in the example, the provided interface is almost as nice as the one provided by ddply.

Supporting Concurrency Abstractions in High-level Language Virtual Machines

Last Friday, I defended my PhD dissertation. Finally, after 4 years and a bit, I am done. Finally. I am very grateful to all the people supporting me along the way and of course to my colleagues for their help.

My work focused on how to build VMs with support for all kind of different concurrent programming abstractions. Since you don’t want to put them into a VM just one by one, I was looking for a unifying substrate that’s up to the task. Below, you’ll find the abstract as well as the slides.

In addition to the thesis text itself, the implementations and tools are available. Please see the project page for more details.

Abstract

During the past decade, software developers widely adopted JVM and CLI as multi-language virtual machines (VMs). At the same time, the multicore revolution burdened developers with increasing complexity. Language implementers devised a wide range of concurrent and parallel programming concepts to address this complexity but struggle to build these concepts on top of common multi-language VMs. Missing support in these VMs leads to tradeoffs between implementation simplicity, correctly implemented language semantics, and performance guarantees.

Departing from the traditional distinction between concurrency and parallelism, this dissertation finds that parallel programming concepts benefit from performance-related VM support, while concurrent programming concepts benefit from VM support that guarantees correct semantics in the presence of reflection, mutable state, and interaction with other languages and libraries.

Focusing on these concurrent programming concepts, this dissertation finds that a VM needs to provide mechanisms for managed state, managed execution, ownership, and controlled enforcement. Based on these requirements, this dissertation proposes an ownership-based metaobject protocol (OMOP) to build novel multi-language VMs with proper concurrent programming support.

This dissertation demonstrates the OMOP’s benefits by building concurrent programming concepts such as agents, software transactional memory, actors, active objects, and communicating sequential processes on top of the OMOP. The performance evaluation shows that OMOP-based implementations of concurrent programming concepts can reach performance on par with that of their conventionally implemented counterparts if the OMOP is supported by the VM.

To conclude, the OMOP proposed in this dissertation provides a unifying and minimal substrate to support concurrent programming on top of multi-language VMs. The OMOP enables language implementers to correctly implement language semantics, while simultaneously enabling VMs to provide efficient implementations.

• Supporting Concurrency Abstractions in High-level Language Virtual Machines, Stefan Marr. Software Languages Lab, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium, PhD Dissertation, January 2013. ISBN 978-90-5718-256-3.
• BibTex: BibSonomy

Slides

Parallel Gesture Recognition with Soft Real-Time Guarantees

It has been a while since SPLASH’12, but I got finally around to put up a copy of our paper at the AGERE’12 workshop. It is based on Thierry’s master thesis and presents his work on parallelizing a Rete engine for gesture recognition. Lode and I were his advisors and are happily working with him on what we promised in the future work section.

Abstract

Applying imperative programming techniques to process event streams, like those generated by multi-touch devices and 3D cameras, has significant engineering drawbacks. Declarative approaches solve these problems but have not been able to scale on multicore systems while providing guaranteed response times.

We propose PARTE, a parallel scalable complex event processing engine which allows a declarative definition of event patterns and provides soft real-time guarantees for their recognition. It extends the state-saving Rete algorithm and maps the event matching onto a graph of actor nodes. Using a tiered event matching model, PARTE provides upper bounds on the detection latency. Based on the domain-specific constraints, PARTE’s design relies on a combination of 1) lock-free data structures; 2) safe memory management techniques; and 3) message passing between Rete nodes. In our benchmarks, we measured scalability up to 8 cores, outperforming highly optimized sequential implementations.

• Parallel Gesture Recognition with Soft Real-Time Guarantees, Thierry Renaux, Lode Hoste, Stefan Marr, Wolfgang De Meuter, Proceedings of the 2nd edition on Programming Systems, Languages and Applications based on Actors, Agents, and Decentralized Control Abstractions (AGERE’2012), page 35–46. ACM (2012).
• Paper: PDF
©ACM, 2012. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
• DOI: 10.1145/2414639.2414646
• BibTex: BibSonomy

Sly and the RoarVM: Exploring the Manycore Future of Programming

My second talk at Smalltalks 2012 was most likely the reason why the organizers invited me in the first place. It was a slightly extended version of the Sly and RoarVM talk for the FOSDEM Smalltalk Dev Room from the beginning of the year, reporting on the Renaissance Project.

This time, the talk included more details on the RoarVM and mentioned a few experiments.

Abstract

The manycore future has several challenges ahead of us that suggest that fundamental assumptions of contemporary programming approaches do not apply anymore when scalability is required.

Sly is a language prototype designed to experiment with the inherently nondeterministic properties of parallel systems. It is designed to enable programmers to embrace nondeterminism instead of guiding them to fight it. Nature shows that complex system can be built from independent entities that achieve a common goal without global synchronization/communication. Sly is design to enable the prototyping of algorithms that show such emerging behavior. We introduce it in the first part of the talk.

The second part of the talk focuses on the underlying problems of building virtual machines for the manycore future, which allow us to harness the available computing power. The RoarVM was design to experiment on the Tilera TILE64 manycore processor architecture, which provides 64 cores and characteristics that are distinctly different from today’s commodity multicore processors. Memory bandwidth, caches, and communication are the biggest challenges on such architectures. This talk gives a brief overview over the design choices of the RoarVM, which tackle the characteristics of the TILE64 architecture.

Acknowledgement: Sly and the RoarVM were designed and implemented by David Ungar and Sam Adams at IBM Research.

Slides

What If: Developing Applications in the Multicore Era

Yesterday was the first day of Smalltalks 2012 in Puerto Madryn. The organizers invited my to give a keynote on a topic of my choice, which I gladly did. Having just handed in my thesis draft, I chose to put my research into the context of Smalltalk and try to relate it to one of the main open questions: How do we actually want to program multicore systems.

The talk went ok, I think. Compared to academic conference, I was surprised by the amount of questions people asked. The discussions for me were also much more interesting than on a typical conference. Overall a good experience.

Abstract

What if you would need to use all the processor cores you got to get your application to run with acceptable performance? This talk explores how we can support the various abstractions for concurrent and parallel programming that would help us to master the challenges of the multicore era. We show a variant of the RoarVM and with a novel metaobject protocol that allows us to implement agents, actors, software transactional memory, and others easily while preserving performance.

Slides

Recording