Skitter - The Extendable Distributed Stream Processing Language

Skitter is a Domain Specific Language for building scalable distributed stream processing applications with pluggable distribution strategies.

Modern Distributed Stream Processing Engines offer developers limited control over how the various operations (map, reduce, join, …) in their applications are distributed over a cluster. We believe the performance of a distributed stream application can be improved by offering developers full control over the distribution logic of their applications. Skitter is a distributed stream processing language which enables the creation of custom distribution strategies, allowing developers to determine how operations are distributed over a cluster.

Skitter is developed as a DSL in Elixir, an extensible language built on top of the Erlang VM. It is available on GitHub and was developed in the context of of my PhD.

Customisable

Skitter enables the creation of custom distribution strategies. These strategies determine how a data processing application is distributed over multiple machines, allowing the creation of distribution logic tailored towards the properties of the application.

Composable

Skitter applications are built by combining reusable data processing operations into a workflow. Each of these operations may be distributed by a different distribution strategy, enabling multiple distribution strategies to be used in a single application.

Extendable

Skitter provides a trait system which makes it possible to extend existing distribution strategies. This makes it possible to build new strategies based on previously defined strategies, while also facilitating minor modifications to existing strategies.

Skitter by Example

Skitter applications are written in terms of three abstractions: operations, workflows and strategies. Operations define the data processing logic of an application, they are composed into workflows, which are used to define data processing pipelines. Each operation is paired with a distribution strategy, this strategy determines how the operation is distributed over the available machines at runtime. Below, we provide a bird's eye overview of these concepts based on a few simple examples.

Operations

The data processing logic of Skitter applications is expressed in operations. These operations are built in Skitter's operation definition language. While writing an operation, a developer does not have to reason about distribution, they only have to specify how an operation reacts to incoming data.

defoperation FahrenheitToCelcius, in: fahrenheit, out: celcius do
  defcb react(fahrenheit) do
    ((fahrenheit - 32) * (5 / 9)) ~> celcius
  end
end

This operation defines a single callback, react which converts its argument fahrenheit to celcius. The resulting value is emitted to the celcius port. At runtime, the strategy of this operation would call react, after which the emitted value is sent to operations downstream of FahrenheitToCelcius.

defoperation Count, in: value, out: current, strategy: KeyedState do
  initial_state 0

  defcb key(value), do: value

  defcb react(value) do
    state() <~ state() + 1
   {value, state()} ~> seen
  end
end

Operations can maintain state. This operation counts the amount of times it has seen each value. The distribution strategy of this operation separates the incoming data elements based on some key, and maintains a state for each key. When a new data element arrives, the strategy will invoke the key callback to determine where to process the data element. Afterwards, the react callback will be invoked to process the data element. This callback updates its internal state after which it emits the current count to its successors.

Workflows

Operations can be composed into workflows. These workflows are built in a simple textual language where operations are linked to other operations or to other, nested, workflows.

This workflow defines a data processing pipeline which counts words. It uses a combination of Skitter's built-in operators (stream_source, flat_map and print), and the Count operation defined above to create a data processing pipeline.

The with: statement can be used to override the strategy that is used to distribute an operation.

workflow do
  stream_source(["Hello Skitter", "Hello world"])
  ~> flat_map(&String.split/1, with: MyCustomStrategy)
  ~> node(Count)
  ~> print()
end

Strategies

A unique feature introduced by Skitter is the notion of a distribution strategy. Every Skitter operation must be paired with a strategy, this can be done in the operation definition or when it is used in a workflow. Like operations and workflows, strategies are created through the use of a DSL. A Strategy is defined by implementing several hooks defined by the Skitter runtime system.


defstrategy KeyedState do
  defhook deploy(args) do
    Remote.on_all_workers(fn -> local_worker(Map.new(), :aggregator) end)
    |> Enum.map(fn {remote, worker} -> worker end)
  end

  defhook deliver(data) do
    key = call(:key, args: [data]).result
    aggregators = deployment()
    idx = rem(Murmur.hash_x86_32(key), length(aggregators))
    worker = Enum.at(aggregators, idx)
    send(worker, data)
  end

  defhook process(data, state_map, :aggregator) do
    key = call(:key, args: [data]).result
    state = Map.get(state_map, key, initial_state())
    res = call(:react, state: state, args: [data])
    emit(res.emit)
    Map.put(state_map, key, res.state)
  end
end

This strategy partitions the state of an operation over several workers based on some key. It is used to distribute the Count operation shown above.

The strategy creates a worker for every machine in the cluster inside the deploy hook, which is called by the runtime system when an operation needs to be deployed over the cluster. Any data returned by this hook is stored inside the deployment, which is automatically available in all other hooks through the use of the deployment() primitive.

When a data element needs to be processed, the deliver hook is called to send it to a worker to be processed. To do so, the strategy calls the key callback of the operation it is paired with (using call, on line 9) to obtain the key associated with the incoming data. After, it hashes the key and sends it to a worker to be processed.

When a worker receives a data element, the process hook is invoked. Here, the strategy fetches the state of the key associated with the incoming data record (lines 17, 18), after which it calls the react callback to process the incoming data with this state. The result of this callback is emitted into the workflow, while the resulting state is updated to be used for subsequent invocations of the process hook.

Getting Started

We have developed Skitter as a DSL in Elixir, it is available on GitHub. Detailed instructions on getting a Skitter project up and running can be found in the documentation. Instructions for getting started with older versions of Skitter described in specific papers can be found by following the documentation link beneath the paper in the "Publications" section below.

Publications

Skitter: A Distributed Stream Processing Framework with Pluggable Distribution Strategies

The Art, Science, and Engineering of Programming

This paper discusses Skitter and its support for expressing distribution strategies in a modular fashion. It introduces the notion of distribution strategies and the reasoning for the design of a stream processing framework which supports them as a first-class concept. The paper discusses the design of Skitter and its distribution strategies and compares their modularity and performance with distribution strategies expressed in Storm.
Paper Slides Bibtex Artifact Documentation for this version of Skitter
Skitter: A DSL for Distributed Reactive Workflows

International Workshop on Reactive and Event-Based Languages and Systems (REBLS), November 2018

This paper discusses the initial version of Skitter, as presented at REBLS 2018 and the SPLASH 2018 Poster Session. This version of Skitter does not allow developers to specify custom distribution strategies. Instead, an effect system was used to declaratively specify the properties of an operation (called "components" at the time). This information was used by the runtime to select an appropriate distribution strategy. The need to support additional effects motivated the creation of the extendable distribution strategies present in Skitter today.
Paper Poster Slides Bibtex Documentation for this version of Skitter

build

Customisable

merge_type

Composable

extension

Extendable

Skitter by Example

Operations

Workflows

Strategies

Getting Started

Publications

Skitter: A Distributed Stream Processing Framework with Pluggable Distribution Strategies

The Art, Science, and Engineering of Programming

Skitter: A DSL for Distributed Reactive Workflows

International Workshop on Reactive and Event-Based Languages and Systems (REBLS), November 2018