Split brain possibility


#1

Hi,

First of all: very interesting project and something coming exactly at the right time for our Content Services (microservices), so kudos.

Question: Is there a possibility for a split brain sitation? Meaning the brokers disconnecting from each other and the microservices communicating with disconnected nodes, causing possible data duplication or worse corruption, because the MicroServices are unaware of the situation.

Kind regards,

Niels.


#2

Hi Niels,

thank you for your question.

tl;dr: No.

Zeebe uses the raft consensus algorithm for replication. With raft every replication group has single leader. The leader is the only one which adds events to the log by accepting client requests or processing events on the log. An event on the log is only visible to the internal state machine and clients if it was committed. Where committed means it was replicated to at least a quorum of nodes. The quorum is the group size divided by 2 plus 1, i.e. for a replication group of 5 nodes the quorum is 3. During a network split there is no constellation which allows two different nodes to reach a quorum of other nodes to commit events on the log. It can happen that during a split two leader exist but only one of them will have access to a quorum of followers to commit events. If the split is over one of the leaders will notice that he is outdated, will step down and resync its log with the current leader.

I hope this answers your question.

Cheers,
Sebastian


#3

Hi Sebastian,

I follow how the broker cluster can split and then reform and establish consistency again. However, for external systems/services and state, ie the ‘business resources’ a task may act on, I am assuming that task operations relative to business resource must be idempotent if the entire distributed system is to support this ‘partition tolerance’.

Let me elaborate. We start with a cluster of brokers called cluster A. I have an order management process which may have two tasks; Pick & Pack Goods, followed by debit customer account. Hence the process starts and the pick&pack task completes. At this point in time, the broker cluster splits into two, Cluster A.1 and Cluster A.2. Hence each cluster may now push the debit customer account task to two distinct task workers. thus if this service is not idempotent, the customer’s account may get debited twice. After this occurs, the cluster A.1 & A.2 re-establish connectivity and thus consenus, however the debit customer account may have been debited twice…

Hence is this a valid behaviour and by implication tasks should as far as possible be idempotent?

regards

Rob


#4

Hi Rob,

thanks for your question.

This is true and even necessary if there is no cluster partition. The task subscription only guarantees that a task is seen at least once. There are different situations which can lead to a task being processed twice by workers, for example if the worker exceeds the lock time of the task, then the task can be handed out to a second worker. Or if the client cannot reach the broker to mark the task as completed, so again the lock time exceeds. Even tough only one client will be able to complete the task it is possible that more then one client receives the task.

I think this scenario would not lead to a task been pushed to two workers. Before a task is pushed to a worker it is locked and this lock event is written to the log. Only when the lock event is committed the task will be pushed to the worker. A leader without a quorum of followers cannot commit this lock event, therefore will not push the task to the worker.

In general all state changes have to be committed to the log before a client will see this state change. This also means if a cluster loses a quorum of followers the cluster is not available anymore in a sense that it will not accept any further state changes until a quorum of followers is available again. This also amplifies your point that a task worker should be idempotent, as it may not able to complete a task on an unavailable cluster.

Cheers,
Sebastian


#5

Thanks Sebastian,

I read up on RAFT concensus algorithm, so I think Ive got it now. A RAFT cluster cannot operate in ‘split brain’ mode, there is either majority consensus, or the cluster is unavailable.

To rephrase using CAP theorem, I would suggest that a zeebe cluster is Consistent and highly Available at the expense of Partition tolerance. Hence aligns to the two out of three according to CAP.

However, idempotency of tasks is still desirable for the reasons you outline above.

regards

Rob


#6

Hi Rob,

I would argue that Raft guarantees consistency and partition tolerance, but not availability as a Raft cluster will not respond to client requests if a cluster lost more then the quorum of nodes.

Other then that I agree with your summary.

Cheers,
Sebastian


#7

Hi Sebastian,

Strictly speaking it is probably more correct to say that RAFT is more CP than CA. I guess it depends on definition of partition tolerant - does it mean will operate with concurrent separate partitions, or it is tolerant of partitioning into quorum versus non quorum…

Either way, our common understanding of the resulting behaviour is consistent :slight_smile:

regards

Rob


#8

FYI on the definition question: Some people argue that you always have P when you can’t make assumptions about the network, see http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/, i.e. there is no CA with a network that can lose messages.


#9

I asked myself one question: If a cluster does not have a predefined number of nodes, but can start new nodes e.g. by auto-scaling - how do you know the total amount of nodes to calculate the require quorum?

Or do we currently handle this by a configured group and quorum size that cannot be adjusted dynamically during runtime?


#10

The number of nodes in a cluster is not related to the quorum of a replication group. A replication group is a group of nodes in the cluster which replicate a single partition. So in a Zeebe cluster there are P replication groups for P partitions. In the current version of Zeebe the size of the replication group is not configurable and will most likely grow if the number of cluster nodes grows. Which is not optimal and only a temporary solution.

In the near future we will change that in a way that the user can define the size of the replication group for a topic when it is created. This group size is then fixed and will not change. Maybe we will allow an API call to change it manually but it will not change automatically.


#11

Thanks for the elaboration!

Just to make sure if I understand correctly:

  • Currently you could still have split brain if the number of nodes grow in one replication group.
  • Later on split brain will be avoidable - but for the price of having a fixed group size which will not auto-scale.
  • A management API could make the scaling dynamic and probably be called by an auto-scaling environment, but this will be a controlled environment where it is made sure somehow to not get a split brain in the moment of upscaling.

Correct?

Thanks
Bernd


#12

Hi Bernd,

we have to make sure that we do not mix up replication with scaling. Replication is used to gain fault-tolerance. It does not give us scalability. Scalability in Zeebe is achieved by partitioning.

So to answer your questions:

Currently you could still have split brain if the number of nodes grow in one replication group.

No as we use Raft as a consensus algorithm no mater which cluster size or replication group size we reach there is never the chance of a split brain. The reason for that is that raft requires a quorum (group size / 2 + 1) to commit log entries and there exist no group size which can be partitioned in a way that there are multiple groups with the size of a quorum.

Later on split brain will be avoidable - but for the price of having a fixed group size which will not auto-scale.

There is no real reason why you want your replication group size (replication factor) to scale. It is basically a trade-off between fault-tolerance and replication speed. So you normally choose either 3, 5 or 7 as replication factor, so you can accept 1, 2 or 3 nodes to fail. Scalability is achieved by partitioning a topic, which means if you think about auto scaling you want to increase/decrease partitions. Increasing partitions is quite simple, but decreasing requires a procedure to deal with the data on the partitions to remove. So it may happen that we implement adding partitions in the near future but removing them is kind of tricky.

Cheers,
Sebastian


#13

Thanks for the clarification! I think I still miss one piece in my understanding:

  • Is Raft Consensus used within one partition or throughout the whole Zeebe cluster? I guess the latter?
  • That means that the whole Zeebe cluster only works if > 50% of the nodes are available, correct?
  • So I still don’t understand how the Zeebe cluster now how many nodes are there? So how is determined which number of nodes are needed for a quorum? Especially as the cluster can grow/auto-scale? What I read from the RAFT paper (https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf) is: “In order to ensure safety, configuration changes must use a two-phase approach.” So something like e.g. “joint consensus” must be implemented in https://github.com/zeebe-io/zb-raft - right? Is this already done?

Sorry for the annoyance! I understand raft at the basic level of http://thesecretlivesofdata.com/raft/ - but I lack some bits and pieces for the full picture. As indicated I am also always happy to take discussions onto Skype or the like if you prefer.

Cheers
Bernd


#14

Hey @berndruecker,

Is Raft Consensus used within one partition or throughout the whole Zeebe cluster? I guess the latter?

the Raft consensus algorithm is used for a single partition. Every partition is an own Raft group and there is no interaction between different Rafts, even if they are part of the same topic. A raft always starts as a single node, the node which creates the partition itself. And then grows to a specific size by inviting other Zeebe cluster nodes. This size will later be known as the replication factor.

The creation of a topic will look something like this:

  • user requests to create topic foo with 4 partitions and a replication factor of 3
  • the broker who is at the moment leader for the internal-system topic partition 0 will then orchestrate the creation of the 4 partitions by choosing 4 brokers to locally create a partition
    • note: this does not require more then a single broker in the cluster as a broker can host multiple partitions
  • for every partition the chosen broker will create the partition locally and then start a new raft group for this partition
  • the raft group leader broker then tries to find enough other brokers to fulfill the replication factor requirement of 3, which means it need 2 followers so the raft group size is 3
    • note: this actually requires a cluster size of at least the replication factor as a single broker cannot be in the same raft group multiple times

So in summary the size of a Zeebe cluster is unrelated to a raft group size, except that it has to be at least the highest replication factor.

That means that the whole Zeebe cluster only works if > 50% of the nodes are available, correct?

The Zeebe cluster is basically always available. But the access to a specific topic partition is only available if the quorum of it’s raft members is available. Assume you have five nodes A, B, C, D, E. And a raft group for partition 1 with the members A, B and C. If D and E become unavailable partition 1 is still accessible. If also a single node of A, B or C is unavailable partition 1 is still fine. But if at least two nodes of A, B or C go down partition 1 will not be available anymore.

So I still don’t understand how the Zeebe cluster now how many nodes are there

The information of how many nodes exist in a Zeebe cluster is distributed over the Gossip protocol. This is not done by Raft. The cluster orchestration uses the information shared over Gossip to find members for a Raft group.

So how is determined which number of nodes are needed for a quorum?

The quorum is always determined in a single raft group (partition) based on the current number of members of this raft group. This is not related to the cluster size.

“In order to ensure safety, configuration changes must use a two-phase approach.”

These configuration operations are adding and removing members to a raft group. Not the Zeebe cluster. And are implemented in a way that only a single node can join or leave a raft group at the same time. This again is not related to brokers joining the Zeebe cluster. And also this is not related to scaling Zeebe. The only reason to add nodes to a raft group is to reach the replication factor. Or leave and add to replace a node. This are operations which will only happen at the start of a raft group or during maintenance. These operations are not necessary to scale Zeebe. To scale Zeebe you would add partitions to a topic.

The interesting part about auto scaling is how to add, spread and remove partitions of a topic in a Zeebe cluster. We don’t have an answer yet for these questions. But nothing of that will change the size of a Raft group in the end, except that for moving partitions from one node to another it would require that a node is exchanged in the Raft group.

Hope that clarifies again a bit what’s going on.

Cheers,
Sebastian


#15

Hi Sebastian.

Yes, that does clear a lot of things for me at the moment (I think). One very last question - and a simple “yes” is sufficient as answer if my assumption is correct:

  • So in case of a network partition with segments (A and B, both having nodes > replication factor) it could be that Zeebe partition 1 still have a leader with quorum in segment A, and partition 2 has a leader with quorum in segment B. Dependent on which segment a client can reach it will either work without problems or reject any requests.

Not really a real-life question, just a double-check that I have understood :slight_smile:

Thanks a lot for taking the time to walk me through! I really appreciate it big time :+1:

Cheers
Bernd


#16

Hi Bernd,

tl;dr: yes

This is the general idea that a partition stays available as long as a quorum of its raft members is reachable. There are details in the client implementation which might still lead to errors, based on the current Implementation of the client. For example subscriptions might break in the moment if not all partitions of a topic are reachable by the client. But these are issues we want to tackle in the future to make the client more resilient against cluster failure scenarios.

Dependent on which segment a client can reach it will either work without problems or reject any requests

One side node on this, the client will send the requests to the leader of a partition if it is reachable. Even if there is no quorum available for this partition at the moment, cause the client cannot know this. And the leader will accept the event and write it to the log. What will not happen is that the event gets committed (== replicated to quorum of raft members). The client will then trigger a timeout error as the response is not received, cause the response is only send when the event was committed. What still can happen is that the event is committed later if the quorum is reachable again. But it could also happen that the leader which wrote the event to its log has to step down after the network partition is resolved. In this case it might happen that the event (which was not committed) will be removed from its log. But this are details of Raft, I just wanted to point out that the client is not always aware why it didn’t got a response. And that this does not mean the event will never be committed.

Cheers,
Sebastian


#17

Yes - that makes perfectly sense and is inline with how I understood Raft. Thanks!