How do I performance profile Zeebe for my use case?

jwulf · February 7, 2020, 4:22am

Sarath Kumar: Hi Team,
I’m running zeebe cluster with 3 brokers, getting this error “io.zeebe.client.api.command.ClientStatusException: Reached maximum capacity of requests handled” while doing load test, with 200 requests in parallel. Kindly help me to find the issue.

Josh Wulf: What’s the issue? You want it to do more?

Sarath Kumar: Yes, we have added 1000 workflows, and 200 instances per second are getting added randomly into this worflows.
This was running fine, until there is no transition of instances between workflows.

Sarath Kumar: I have a use case where my instance1 while processing, will add instance to workflow2. So while doing this process from worker, we are getting the mentioned exception

Sarath Kumar: yes,

Sarath Kumar:

    WorkflowInstanceEvent workflowInstanceEvent = (WorkflowInstanceEvent)this.zeebeClient.newCreateInstanceCommand().bpmnProcessId(bpmnProcessId).latestVersion().variables(variables).send().join();
   
    return workflowInstanceEvent.getWorkflowInstanceKey();
}

Josh Wulf: So, are you saying that “a Worker for a Service Task in Workflow 1 creates an instance of Workflow 2”

Sarath Kumar: yes, correct

Josh Wulf: And how are Workflow 1 instances created?

Sarath Kumar: They are created via separate service

Josh Wulf: So for each workflow 1, you create one instance of workflow 2?

Sarath Kumar: It depends on the use case. Some of the case it will end in workflow 1 itself. But for some use cases, it will create instances in any of the other desired workflow.

Sarath Kumar: We are seeing this exception randomly, while creating this new instance from worker

Josh Wulf: So I hear two different things

Josh Wulf: One is an error message in the worker, appearing randomly, that the max # of parallel requests are in flight

Josh Wulf: The other is that workflows no longer progress on the brokers

Josh Wulf: Is that correct?

Josh Wulf: Or by there is no transition of instances between workflows. do you mean that workers can no longer create new instances of workflows?

Sarath Kumar: yes, I have added an instance to worflow1, and workflow1 trying to add new instance to workflow2, while doing this at the rate of 200req/sec, we are seeing this error "“io.zeebe.client.api.command.ClientStatusException: Reached maximum capacity of requests handled”

As the worker is throwing an error, workflow instance moves to failed state

Josh Wulf: Uh-huh

Josh Wulf: 3 brokers. One gateway? How many workers? What resources do the brokers have? RAM, CPU?

Josh Wulf: Probably the gateway is saturated

Sarath Kumar: yes, 3 brokers with 1 gateway. Let me pull the resources

Josh Wulf: Also, what does zbctl status show you?

Josh Wulf: And what is in the Gateway logs?

Josh Wulf: That message is emitted here: https://github.com/zeebe-io/zeebe/blob/a417a8fee9c93856420ed59ab10f2517a925a081/broker/src/main/java/io/zeebe/broker/transport/commandapi/CommandApiRequestHandler.java#L140

Josh Wulf: That will tell you in the logs what partition:

Josh Wulf: "Partition-{} receiving too many requests. Current limit {} inflight {}, dropping request {} from gateway"

Sarath Kumar: yeah sure, let me pull the logs from gateway

Josh Wulf: @UEHR7VBBP do you need to run at a log level other than INFO to see that message in the logs?

Sarath Kumar: 2020-02-04 06:28:20.803 [] [main] INFO io.zeebe.gateway - Starting gateway with configuration {
“network”: {
“host”: “10.177.71.93”,
“port”: 26500,
“minKeepAliveInterval”: “30s”
},
“cluster”: {
“contactPoint”: “10.177.71.226:26502”,
“maxMessageSize”: “4M”,
“requestTimeout”: “15s”,
“clusterName”: “zeebe-cluster”,
“memberId”: “gateway”,
“host”: “0.0.0.0”,
“port”: 26502
},
“threads”: {
“managementThreads”: 1
},
“monitoring”: {
“enabled”: false,
“host”: “0.0.0.0”,
“port”: 9600
},
“security”: {
“enabled”: false
}
}

Sarath Kumar: Above are the ones available in log., Not able to see that partition error message

Josh Wulf: and zbctl status?

Sarath Kumar: Cluster size: 3 Partitions count: 16 Replication factor: 3 Brokers: Broker 0 - 10.177.71.226:26501 Partition 1 : Leader Partition 2 : Leader Partition 3 : Follower Partition 4 : Leader Partition 5 : Leader Partition 6 : Follower Partition 7 : Leader Partition 8 : Leader Partition 9 : Follower Partition 10 : Leader Partition 11 : Leader Partition 12 : Follower Partition 13 : Leader Partition 14 : Leader Partition 15 : Follower Partition 16 : Leader Broker 1 - 10.177.71.39:26501 Partition 1 : Follower Partition 2 : Follower Partition 3 : Follower Partition 4 : Follower Partition 5 : Follower Partition 6 : Follower Partition 7 : Follower Partition 8 : Follower Partition 9 : Follower Partition 10 : Follower Partition 11 : Follower Partition 12 : Follower Partition 13 : Follower Partition 14 : Follower Partition 15 : Follower Partition 16 : Follower Broker 2 - 10.177.71.131:26501 Partition 1 : Follower Partition 2 : Follower Partition 3 : Leader Partition 4 : Follower Partition 5 : Follower Partition 6 : Leader Partition 7 : Follower Partition 8 : Follower Partition 9 : Leader Partition 10 : Follower Partition 11 : Follower Partition 12 : Leader Partition 13 : Follower Partition 14 : Follower Partition 15 : Leader Partition 16 : Follower

Josh Wulf: Too many partitions

Josh Wulf: With three brokers you want to have three partitions

Josh Wulf: More partitions == more work

Josh Wulf: Which is cool if you have more brokers, because then they can each get some

Josh Wulf: Broker 1 is not doing any processing

Josh Wulf: two of the nodes are doing all the work

Josh Wulf: and they are having to manage 16 partitions while they do it

Josh Wulf: too much overhead

Josh Wulf: I would start there: 3 partitions, try it again, see where it fails

Josh Wulf: then restart the brokers in TRACE log level and do it again, and study the logs

Josh Wulf: and turn on Prometheus metrics and start building graphs to see where the resource pressure is on each machine and in the system

Sarath Kumar: Just for clarification, will the partition count affects the request count.

Josh Wulf: Also, this: https://zeebe.io/blog/2019/12/zeebe-performance-profiling/

Josh Wulf: It will affect the amount of meta-work that your brokers are doing

Josh Wulf: managing a partition is not servicing requests, so they become resource starved

Sarath Kumar: Yeah sure, i will reduce the partition, but this will reduce the processing speed right?
We have already done the prometheus setting, if possible sharing me the metric to calculate resource pressure

Josh Wulf: How much? I don’t know. Measure it and let me know - I’m interested to know

Josh Wulf: How do more partitions increase the processing speed?

Josh Wulf: They do if they are spread across more brokers

Josh Wulf: If you had 16 partitions and 16 brokers then of course it will be faster

Josh Wulf: But if you have 3 men, how does giving them 16 jobs get your house built faster?

Josh Wulf: You need 16 people doing 16 jobs for that to be effective

Sarath Kumar: oh okay sure, will try this configuration.
And what’s the max parallel request count will gateway can handle?

Josh Wulf: At the moment you have two people doing 16 jobs, and one - Broker 1, watching them

Josh Wulf: It depends

Josh Wulf: Put it on a Raspberry Pi like @USX4VT2AC and you’ll get one metric

Josh Wulf: Put it on a 16-CPU machine with 256GB RAM and you’ll get another one

Josh Wulf: The only way to know is to measure and adjust parameters

Josh Wulf: That’s what you are doing now - science

Sarath Kumar: yeah sure, we will give a try on this configurations

Josh Wulf: devise experiments, take notes, make hypotheses

Josh Wulf: That’s how Mendel discovered genetic inheritance. Careful observation and experimentation

Josh Wulf: And read that blog post - it is exactly about how to do this - performance profiling

Sarath Kumar: Yup sure !!

Josh Wulf: I would run it with partition counts from 3 to 5, replication 0, 2, and 3

Josh Wulf: That gives you a matrix

Josh Wulf: That’s nine combinations

Josh Wulf: Measure throughput, end-to-end latency and mean workflows to failure

Josh Wulf: Then do the same thing again with different sized instances

Josh Wulf: I’d automate that. because if you try two different instance sizes, that’s already 18 runs

Josh Wulf: and if you automate it, you can just as easily do three sizes for 27 runs

Josh Wulf: And then when a new version comes out, before you deploy it, you run the same test suite again, and BAM! You’re a pro

Josh Wulf: Then you give a talk at a conference and get hired by IBM

Josh Wulf: I saw two guys talk at http://DevConf.CZ|DevConf.CZ last week

Josh Wulf: They spent six months profiling databases on Cloud providers

Josh Wulf: https://twitter.com/sitapati/status/1221389377340956672

Sarath Kumar: oh wow, well motivated to complete this

Josh Wulf: https://twitter.com/sitapati/status/1221399049233928193

Josh Wulf: You got this

Josh Wulf: @UTM6C2C3H How do I performance profile Zeebe for my use case?

Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.

falko.menge · February 10, 2020, 5:25pm

Shouldn’t the number of partitions match the number of CPU threads divided by the replication factor rather than just the number of brokers?

Zelldon · February 10, 2020, 9:14pm

I think the formular would be

threads = ceil(partitions * (replicationFactor / brokerCount))

So for example you have 5 partitions, 4 brokers and replication factor 3. The matrix would look like this

Partition \ Node	0	1	2	3
1	L	F	F
2		L	F	F
3	F		L	F
4	F	F		L
5	L	F	F

Which means a broker takes part of max. 4 partitions so they need around 4 cpu and io threads to do there work.

jwulf · February 11, 2020, 1:28am

Ah, because the stream processor for a partition runs in a single thread.

So then you would want to check if your brokers were CPU or I/O bound to see if they have capacity to handle more partitions?

jwulf · December 9, 2020, 10:20pm

See this post for an updated take on this: New Zeebe Performance Benchmarking Tool

system · January 31, 2024, 10:09am