Battle testing Zeebe cluster

grexe_1 · February 25, 2020, 2:47pm

Are there any facilities, tips, guides, best practices etc. for battle testing a Zeebe cluster?
I am evaluating Zeebe for a new project and we want to ensure proper operation in the case of nodes dying, network connectivity issues, etc., wrt workflow execution.
Something like Netflix’ chaos monkey project would make a great extension for Zeebe…

Zelldon · February 25, 2020, 3:03pm

Hey @grexe

we running chaos experiments via the chaos toolkit. I want to open source them in the near future. Furthermore I plan, as a side project, to build a extension for the chaos toolkit such that it is easier to write chaos experiments for Zeebe. Would that something what you would use?

Greets
Chris

grexe_1 · February 25, 2020, 3:15pm

This would be awesome @Zelldon! Great to see work in this area, also just discovered and really liked the Chaos Toolkit! Looks like Selenium for chaos testing…
I’d be very happy to use a preliminary version for our own tests here, and provide feedback or even PR’s.

salaboy · February 25, 2020, 3:24pm

@grexe that might be a great community contribution for testing the helm charts as well… I think that Chaos testing makes a lot of sense if we are running Zeebe in an orchestrator that can help Zeebe to heal. @Zelldon we should probably think about open sourcing that as well… I will be working on adding a testing pipelines for the charts so it feels to me that there is a perfect opportunity to glue all those things together.

Zelldon · February 25, 2020, 3:39pm

Great to see work in this area, also just discovered and really liked the Chaos Toolkit!

Yes I’m also a fan of it. I can recommend the book “Learning Chaos Engineering by Russ Miles”, when you are interested in this topic. It is written by one of the creators of the toolkit and uses a lot of examples how to use it.

I’d be very happy to use a preliminary version for our own tests here, and provide feedback or even PR’s.

That would be nice

This was the plan Sounds good @salaboy !

I will write here an update, when I published the first experiments.

Greets
Chris

grexe_1 · February 25, 2020, 4:04pm

Thanks, I’ve started reading it already, since we have a Safaribooks company account here:)

Looking forward to your contribution and helping battle testing it:)

Zelldon · February 28, 2020, 1:34pm

Hey @grexe

as promised I made a first serve here GitHub - zeebe-io/zeebe-chaos: Contains everything related to chaos engineering in Zeebe, which means chaos experiments, hypothesis backlog etc.

Happy to hear your feedback

Greets
Chris

grexe_1 · February 28, 2020, 2:01pm

nice, already discovered the stub before:)
Should I add my configs/experiments for zeebe-docker-compose there? (e.g., I see your scripts assume Kubernetes installed)

grexe_1 · March 5, 2020, 4:33pm

so I did some more chaos testing on my local docker-compose setup (default setup from zeebe-docker-compose project) and I get mixed results, note I’m using this to track and test Zeebe issue #3920.
You can follow and review my PR here: https://github.com/zeebe-io/zeebe-chaos/pull/1

After fixing an error in my test, containers come up fine and the test runs through reproducibly, but it’s a bit fishy, e.g. during my last run, I got this:

zeebe_broker_1 | 2020-03-05 16:26:00.920 [] [raft-server-0-raft-partition-partition-1-heartbeat] ERROR io.atomix.utils.concurrent.SingleThreadContext - An uncaught exception occurred
zeebe_broker_1 | java.lang.IllegalStateException: not on a Catalyst thread
zeebe_broker_1 | 	at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-28.1-jre.jar:?]
zeebe_broker_1 | 	at io.atomix.utils.concurrent.ThreadContext.checkThread(ThreadContext.java:71) ~[atomix-utils-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at io.atomix.protocols.raft.impl.RaftContext.checkThread(RaftContext.java:550) ~[atomix-raft-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at io.atomix.protocols.raft.impl.RaftContext.transition(RaftContext.java:521) ~[atomix-raft-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at io.atomix.protocols.raft.roles.ActiveRole.onLeaderHeartbeat(ActiveRole.java:250) ~[atomix-raft-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at io.atomix.protocols.raft.impl.RaftContext.lambda$null$30(RaftContext.java:230) ~[atomix-raft-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at io.atomix.utils.concurrent.SingleThreadContext$1.lambda$execute$0(SingleThreadContext.java:53) ~[atomix-utils-3.2.0-alpha10.jar:?]
zeebe_broker_1 | 	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
zeebe_broker_1 | 	at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
zeebe_broker_1 | 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
zeebe_broker_1 | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
zeebe_broker_1 | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
zeebe_broker_1 | 	at java.lang.Thread.run(Unknown Source) [?:?]

The broker came up fine after all, but my setup was not ideal (broker 2 being leader for both partitions):

❯ zbctl --insecure --address 127.0.0.1:26500 status
Cluster size: 3
Partitions count: 2
Replication factor: 3
Gateway version: unavailable
Brokers:
  Broker 0 - 172.25.0.2:26501
    Version: unavailable
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 1 - 172.25.0.3:26501
    Version: unavailable
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 2 - 172.25.0.4:26501
    Version: unavailable
    Partition 1 : Leader
    Partition 2 : Leader

Also, when killing the “wrong” node, the gateway dies with it, and zbctl cannot reach the cluster anymore:

transport: Error while dialing dial tcp 127.0.0.1:26500: connect: connection refused

The cluster is still up and running fine though (remaining 2 nodes).

Should I be worried about the Exception, and is the 2-leader-setup a known issue?

jwulf · March 6, 2020, 3:05am

Having one broker leading more than one partition, in itself, is not an issue. It’s an artifact of an unmanaged cluster.

The team (fairly recently) did some work to have leader distribution more evenly distributed in the cluster, but it is not trivial to distribute leadership in a distributed cluster with no cluster manager. You can imagine various election strategies, but ain’t nobody got time for massive election cycles - it takes away time from getting actual work done.

I would be concerned about the Exception. Exception messages are literally a statement that something is wrong. The code has taken a path that was not revealed in testing and did not factor into the engineer’s reasoning about its behaviour at run-time.

I can’t find anything in the known issues about this error. I would open a bug report for it.

Zelldon · March 6, 2020, 4:20am

Hey guys,

the exception can happen during a transition and is not a real problem. We aware of it but had no time to fix it yet https://github.com/zeebe-io/atomix/issues/64

Greets
Chris

system · January 31, 2024, 10:11am