How does Zeebe handle cluster failover? Can I gracefully shut down a cluster node?

jwulf · February 5, 2020, 2:56pm

calvin: i want to script restarts of k8s nodes running zeebe containers
is there any API i can hit to gracefully remove a zeebe node from the cluster while it reboots?
also is there an API i can watch to see that the remove was successful?

Josh Wulf: There is no API for this. Zeebe is magical and automatically handles this - that’s the design. We have been running multi-node clusters on pre-emptable nodes since last year.

If you find any scenarios where it is not magical, please open a GitHub issue.

calvin: thanks Josh! ok - have a hard time with magic personally lol! any chance you could point me out the bit in your git repo where zeebe manages this? be good just to have an overview of how it works

Alastair Firth: From the k8s side, it sends sigterm to pods before evicting them. default of 30s before sigkill. sig is passed on by tini https://github.com/zeebe-io/zeebe/blob/develop/Dockerfile#L26
From where java handles the exit, I’m not sure. Typically you would run zeebe in a statefulset, so as a k8s node reboots the pod is just evicted and rescheduled on a node which is still up - from the broker’s perspective it’s not removed from the topology, it’s just temporarily unavailable. Hope that helps

Josh Wulf: Even better, here is a video of it live by the person who wrote it:

Josh Wulf: https://www.youtube.com/watch?v=JbXKgQQmukE

calvin: awesome thanks Josh and Alastair!!

Josh Wulf: The whole design of Zeebe with the replicated event log and cluster management is the part where it is managed. Zeebe can run as a single broker, but what you are talking about is Zeebe as a system. There is no graceful exit for a single node in that scenario. It just goes away, and the others take up its work, because it is replicated across a cluster.

Josh Wulf: It’s either there and leading a partition, or it is not. And if it is not, another node has a replica and just takes over. Individual nodes are cattle

Josh Wulf: I don’t even think you could design a graceful exit. The only thing you could really do is signal the cluster that you are about to die, and they do an election without you, elect new leaders for any partitions you are leading, and start projecting any state in advance

Josh Wulf: But that is fraught with problems and complexities

Josh Wulf: What happens is - the broker goes away, it misses a heartbeat, the others notice, elect new leaders for any partitions that broker was leading, project any missing state from the event log, and carry on.

Josh Wulf: You have a single cluster node removal mode - total failure - which deals with planned and unplanned outages.

Josh Wulf: A node is either all in, or not at all. You could reduce the latency of the kick-over time by eagerly projecting state on new leaders, but now you are supporting two removal modes - plus the edge cases (a broker says: “I’m leaving!” but doesn’t - now it’s a Zombie, or unreliable, and other nodes are reacting to a failure that hasn’t happened - will he, won’t he?

Josh Wulf: Using resources, unclear about the broker topology. How does that broker recover? “Jks guys - I’m staying!”

Josh Wulf: Then it does it a few times, now the cluster is thrashing

Josh Wulf: Nah - either you are alive or dead. The cluster doesn’t care. We got this.

calvin: hmmm

calvin: great info by the way

Josh Wulf: <@UTM6C2C3H>

calvin: i guess (thinking in terms of the only thing i know - elasticsearch) it would be nice to be able to avoid the thrashing by telling Zeebe not to bother re-electing leaders for partitions “because i’m just doing planned maintenance”

calvin: but no worries, i’m happy that it handles it at all

Josh Wulf: If the node goes down, it is actively servicing partitions

Josh Wulf: all work running on those partitions stop

Josh Wulf: So your clients start throwing errors and processes running on those partitions halt

Josh Wulf: Timers stop

Josh Wulf: Attempts to publish messages to those running workflows have no effects

Josh Wulf: That node is actively managing a slice of the active state of the cluster

Josh Wulf: The others have that state, but as a replica of the actions that node is taking

Josh Wulf: the partiton leader election has another node assume the role of active state management for that part

Josh Wulf: From an outside perspective there may be a momentary lag in processes that are running on a partition that has a new leader, and depending on the load and available resources remaining in the cluster, overall system latency may be affected

Josh Wulf: but if you have sufficient headroom, it is not - because you have n + 1 for what you need to service your workload

Josh Wulf: That is the redundancy aspect

Josh Wulf: I’m not sure how Elastic Search clustering works - but it has a database, so shared mutable state

Josh Wulf: Any node has access to the same state at any time, or is it sharded in memory on nodes?

calvin: afaik the “master nodes” provide state which is accessible by all nodes. the main problem is that re-allocation of the partitions/indices in elasticsearch is a huge operation as you normally specify a minimum number of backup indices, so if one node goes down -> huge copy operations start

Josh Wulf: Zeebe is replicating all the time to allow failover, and the leader of a partition is doing in-memory computation - timers, state transitions

Josh Wulf: 15m-interval snapshot of in-memory state is the default, and every event is replicated to disk before it is actioned

Josh Wulf: So what is missing from the last replicated memory state snapshot is regenerated from the on-disk log by another node on fail-over

Josh Wulf: “Just one second…. ah ok, you were after a latte - here it is”

Josh Wulf: <@UTM6C2C3H>

archivist2: Gosh, this is an interesting conversation - I’m sending a copy to https://forum.zeebe.io for future reference!

archivist2: Tag me with what you want as the post title, and I’ll put this thread in the Forum for you.

Josh Wulf: <@UTM6C2C3H> Zeebe fail-over

archivist2: Gosh, this is an interesting conversation - I’ve filed a copy at undefined for future reference!

Josh Wulf: <@UTM6C2C3H>

archivist2: Tag me with what you want as the post title, and I’ll put this thread in the Forum for you.

Josh Wulf: <@UTM6C2C3H> Zeebe-failover

system · January 31, 2024, 10:11am