Cluster dies after chaos test

(I am sure I am doing something wrong, but I have not yet found out what)

I am using the default cluster setup from the zeebe-docker-compose project, with Zeebe 0.22.1, 3 nodes with 2 partitions each.

Running a modified ChaosToolkit chaos test on my local machine (MacBook Pro, MacOS X 10.15.3 (19D76), 32GB RAM) to randomly kill a broker, I get lots of timeout errors and now the cluster refuses to respond to deployments:
(the naming is a bit confusing since the docker-compose config defines brokers 1-based, but Zeebe refers to the 0-based nodes it seems)

[Broker-1-zb-actors-1] WARN io.zeebe.broker.workflow.repository - Failed to receive deployment response for partitions [2] (topic ‘deployment-response-2251799813693058’). Retrying
(repeated ad infinitum)

Broker 1 sems fine:

zeebe_broker_1 | 2020-02-28 09:44:56.692 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Removing follower partition service for partition PartitionId{id=1, group=raft-partition}
zeebe_broker_1 | 2020-02-28 09:44:56.696 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from null to LEADER
zeebe_broker_1 | 2020-02-28 09:44:56.697 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.system - Installing leader partition service for partition PartitionId{id=1, group=raft-partition}

Broker 3 also seems fine:

zeebe_broker_3 | 2020-02-28 09:45:06.666 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-2 partitions succeeded. Started 2 steps in 3546 ms.
zeebe_broker_3 | 2020-02-28 09:45:06.668 [Broker-2-HealthCheckService] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - All partitions are installed. Broker is ready!
zeebe_broker_3 | 2020-02-28 09:45:06.669 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-2 [11/11]: zeebe partitions started in 3567 ms
zeebe_broker_3 | 2020-02-28 09:45:06.670 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-2 succeeded. Started 11 steps in 44302 ms.

But Broker 2 dies because of a log exception:

zeebe_broker_2 | java.lang.IllegalStateException: Expected to find event with the snapshot position 21475014512 in log stream, but nothing was found. Failed to recover 'Broker-0-StreamProcessor-2'.
zeebe_broker_2 | 	at io.zeebe.engine.processor.StreamProcessor.recoverFromSnapshot(StreamProcessor.java:209) ~[zeebe-workflow-engine-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.engine.processor.StreamProcessor.onActorStarted(StreamProcessor.java:121) ~[zeebe-workflow-engine-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.22.1.jar:0.22.1]
zeebe_broker_2 | 	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.22.1.jar:0.22.1]
1 Like

Hey @grexe,

Cool that you started the first experiments :smiley:
Could you share your experiment json file?
Do you have the related data directory still available? Could you share it ?
We currently investigating some bugs related to inconsistent log, this could be one of the issue.

Greets
Chris

Thanks for the super-quick-as-usual response! Should I open a Github issue and attach the stuff here or do you want to breed the issue here further a bit?
Update: will use Github as I cannot attach non-image files here (at least it’s not intended for that)

Hey @grexe

I think it might to be related to that https://github.com/zeebe-io/zeebe/issues/3920

But to make sure I would like to check your log, whether it is inconsistent or not.
If so then it make sense to add a comment in this issue how to reproduce it, maybe via your chaos experiment. This would help us a lot :rocket:

Greets
Chris

3 Likes

Done, added my observations with 0.22.1 and 0.23.0-alpha1 to the issue.

1 Like