(I am sure I am doing something wrong, but I have not yet found out what)
I am using the default cluster setup from the zeebe-docker-compose project, with Zeebe 0.22.1, 3 nodes with 2 partitions each.
Running a modified ChaosToolkit chaos test on my local machine (MacBook Pro, MacOS X 10.15.3 (19D76), 32GB RAM) to randomly kill a broker, I get lots of timeout errors and now the cluster refuses to respond to deployments:
(the naming is a bit confusing since the docker-compose config defines brokers 1-based, but Zeebe refers to the 0-based nodes it seems)
[Broker-1-zb-actors-1] WARN io.zeebe.broker.workflow.repository - Failed to receive deployment response for partitions [2] (topic ‘deployment-response-2251799813693058’). Retrying
(repeated ad infinitum)
Broker 1 sems fine:
zeebe_broker_1 | 2020-02-28 09:44:56.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Removing follower partition service for partition PartitionId{id=1, group=raft-partition}
zeebe_broker_1 | 2020-02-28 09:44:56.696 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from null to LEADER
zeebe_broker_1 | 2020-02-28 09:44:56.697 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Installing leader partition service for partition PartitionId{id=1, group=raft-partition}
Broker 3 also seems fine:
zeebe_broker_3 | 2020-02-28 09:45:06.666 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-2 partitions succeeded. Started 2 steps in 3546 ms.
zeebe_broker_3 | 2020-02-28 09:45:06.668 [Broker-2-HealthCheckService] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - All partitions are installed. Broker is ready!
zeebe_broker_3 | 2020-02-28 09:45:06.669 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-2 [11/11]: zeebe partitions started in 3567 ms
zeebe_broker_3 | 2020-02-28 09:45:06.670 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-2 succeeded. Started 11 steps in 44302 ms.
But Broker 2 dies because of a log exception:
zeebe_broker_2 | java.lang.IllegalStateException: Expected to find event with the snapshot position 21475014512 in log stream, but nothing was found. Failed to recover 'Broker-0-StreamProcessor-2'.
zeebe_broker_2 | at io.zeebe.engine.processor.StreamProcessor.recoverFromSnapshot(StreamProcessor.java:209) ~[zeebe-workflow-engine-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:121) ~[zeebe-workflow-engine-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.22.1.jar:0.22.1]