Meta Protocol for Replication Strategies

Printer-friendly version

Replicating data is used everywhere: from caching values in your CPU to geo-replicating servers to handle a high number of requests over the entire world.

The main denominator of all these applications is availability and fault-tolerance. In simple words, if you put the same data on multiple locations, more people can access it faster at the same time and you have less chance of losing it.

Of course, if you replicate data on multiple locations, these replicas need to be kept consistent. There are two main different ways to do this. First, you never allow the replicated values to have a different value, i.e. if one of the values is changed on a site, all of the replicas are changed at the same time. This is called pessimistic replication. This puts a hard constraint on the replication strategy, namely the replicas can not be updated at the same time. In other words, conceptually all the replicas need to be locked when it is being changed on one site (computer or process). This provides an easy to understand programming model, but does not allow for a lot of concurrency. The second strategy is called optimistic replication. Here, the replicas are allowed to be updated concurrently. The algorithm is 'optimistic' in the sense that it assumes that not many conflicts will occur by allowing concurrent updates. If conflicts happen, they need to be resolved afterwards. By allowing concurrent updates, the replicated values will diverge and a consistency algorithm needs to make sure they eventually converge again. This is often also called eventual consistency. There are many different guarantees that the system can provide over the slightly diverging data: causal consistency, read-your-writes, monotonic reads, PRAM, etc.

The problem today is that every system implements its own set of guarantees over the data and allows for no or very little variation in it (e.g. Riak, DynamoDB, NoSQL, PostgreSQL, CRDTS, Cloud Types, Bayou, etc). It is up to the programmer to know the guarantees of each system and pick the best suited for its application at hand. The problem with this is of course that once a system is chosen and the requirements of the application change, it is hard to migrate to another system. Even more, probably some data of the application can function under some looser guarantees (e.g. youtube likes) while others (e.g. ticket reservation) require very strong guarantees.

Goal & Research Activities: 

In this thesis you will focus on higher-level replication, namely on the level of databases or application-level variables, because it represents the widest use of replication today. The ideas that you will work out will easily translate to other levels though. The goal is that you come up with a meta protocol for replicating data between multiple sites. By altering the meta protocol of the data you change the way it is replicated (e.g. from strongly consistent to causally consistent). The biggest challenge will be to come up with a meta protocol that is expressive enough to implement consistency strategies ranging from strongly consistent to weakly consistent. And, at the same time it still needs to be expressive and high-level enough such that the programmer does not have to implement every single bit of the algorithm herself. The idea is that an expert distributed programmer can implement a replication strategy using the meta protocol and that other, non-expert distributed, programmers can then use these protocols by declaring that their data uses it. This allows to change the replication strategy of certain data by simply changing its protocol.

Your work will comprise of the following things:

  1. Design a suitable meta protocol that allows to implement consistency strategies, ranging from strong to weakly consistent. This includes solving the following research questions:

    1. How do you express which sites participate in the replication of a particular piece of data? (what is the hierarchy)

    2. In what form are changes propagated to other sites? (operations, the entire state or only the delta-state)

    3. How are changes propagated to other sites? (flooding, master-slave, p2p, ...)

    4. If concurrent operations are allowed, how are conflicts detected and resolved?

  2. Make an implementation of the meta protocol you came up with in an actor language (e.g. Erlang, AmbientTalk or JavaScript with Actor.js). The notion of a site in replication maps very well to the notion of an actor.

  3. Implement a significant number of replication strategies using your meta protocol implementation.

  4. Run some benchmarks with different strategies to demonstrate the different guarantees, trade-offs and performance differences of the different strategies.

For this thesis you will become a ninja in the field of Replication Strategies, more specifically optimistic or eventually consistency strategies (which is a very hot topic right now with the peaking importance of distributed and cloud computing). Although, this is just an initial proposal and it will be entirely up to the student to choose its path. Any other completely different proposal concerning distribution, replication and/or cloud are also welcome with us.

--- Masters and