When I use Zeebe, its performance problems and message timeouts are very serious. Below is my detailed description

I’m doing Zeebe’s performance stress test recently.
My machine configuration is as follows.
Three gateways and five brokers were deployed. Each service is in a separate virtual machine. A total of eight virtual machines.

Three gateways:
vCPUs Platform Memory (GiB)
4 64-bit 8
Five brokers:
vCPUs Platform Memory (GiB)
8 64-bit 32

My bpmn process is as follows:


I started 100 threads and started 10,000 instances. There are hundreds of messages that time out, and an instance takes up to 4 minutes to complete.
I did not enable the back pressure mechanism. The CPU consumption of the machine occupies about 20%. But why is there a message timeout and the instance takes too long to complete?
This situation is very inconsistent with the data of this link.https://zeebe.io/blog/2018/06/benchmarking-zeebe-horizontal-scaling/

I am very worried now because we plan to use Zeebe in the actual production environment. Current performance issues are worrying.

Which version of Zeebe are you using? If the numbers doesn’t match there are several possibilities, different configurations, limits in how you start the brokers inside the virtual machines, if they are not using more of the CPU available that means that something is limiting the running process to use more.
Can you share the parameters that you are using in the brokers and the gateway?

@wenminglei if you are planning to use this for a production scenario, have you considered getting support from Camunda?

1 Like

Hello, I am using version 0.23.1. The configuration of the gateway is as follows:
“network” : {
“host” : “0.0.0.0”,
“port” : 26500,
“minKeepAliveInterval” : “PT15S”
},
“cluster” : {
“contactPoint” : “10.151.30.xxx:26502”,
“requestTimeout” : “PT30S”,
“clusterName” : “zeebe-cluster”,
“memberId” : “gateway”,
“host” : “0.0.0.0”,
“port” : 26502
},
“threads” : {
“managementThreads” : 1
},
“monitoring” : {
“enabled” : false,
“host” : “0.0.0.0”,
“port” : 9600
},
“security” : {
“enabled” : false,
“certificateChainPath” : null,
“privateKeyPath” : null
}
}
The broker configuration is as follows:
“network” : {
“host” : “10.151.30.xxx”,
“portOffset” : 0,
“maxMessageSize” : “4MB”,
“advertisedHost” : “10.151.30.xxx”,
“commandApi” : {
“host” : “10.151.30.xxx”,
“port” : 26501,
“advertisedHost” : “10.151.30.xxx”,
“advertisedPort” : 26501,
“address” : “10.151.30.xxx:26501”,
“advertisedAddress” : “10.151.30.xxx:26501”
},
“internalApi” : {
“host” : “10.151.30.xxx”,
“port” : 26502,
“advertisedHost” : “10.151.30.xxx”,
“advertisedPort” : 26502,
“address” : “10.151.30.xxx:26502”,
“advertisedAddress” : “10.151.30.xxx:26502”
},
“monitoringApi” : {
“host” : “10.151.30.xxx”,
“port” : 9600,
“advertisedHost” : “10.151.30.xxx”,
“advertisedPort” : 9600,
“address” : “10.151.30.xxx:9600”,
“advertisedAddress” : “10.151.30.xxx:9600”
},
“maxMessageSizeInBytes” : 4194304
},
“cluster” : {
“initialContactPoints” : [ “10.151.30.xxx:26502”, “10.151.30.xxx:26502”, “10.151.30.xxx:26502”, “10.151.30.xxx:26502”, “10.151.30.xxx:26502” ],
“partitionIds” : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],
“nodeId” : 0,
“partitionsCount” : 10,
“replicationFactor” : 1,
“clusterSize” : 5,
“clusterName” : “zeebe-cluster”,
“gossipFailureTimeout” : 10000,
“gossipInterval” : 250,
“gossipProbeInterval” : 1000
},
“threads” : {
“cpuThreadCount” : 4,
“ioThreadCount” : 4
},
“data” : {
“directories” : [ “/apps/zeebe-broker-0.23.1/data” ],
“logSegmentSize” : “512MB”,
“snapshotPeriod” : “PT10M”,
“logIndexDensity” : 100,
“logSegmentSizeInBytes” : 536870912
},
“exporters” : {
“elasticsearch” : {
“jarPath” : null,
“className” : “io.zeebe.exporter.ElasticsearchExporter”,
“args” : {
“url” : “http://10.151.30.xxx:9200”,
“bulk” : {
“delay” : 5,
“size” : 1000
},
“index” : {
“prefix” : “zeebe-record”,
“createTemplate” : true,
“command” : false,
“event” : true,
“rejection” : false,
“deployment” : true,
“error” : true,
“incident” : true,
“job” : true,
“jobBatch” : false,
“message” : false,
“messageSubscription” : false,
“variable” : true,
“variableDocument” : true,
“workflowInstance” : true,
“workflowInstanceCreation” : false,
“workflowInstanceSubscription” : false
}
},
“external” : false
}
},
“gateway” : {
“network” : {
“host” : “10.151.30.130”,
“port” : 26500,
“minKeepAliveInterval” : “PT30S”
},
“cluster” : {
“contactPoint” : “10.151.30.130:26502”,
“requestTimeout” : “PT15S”,
“clusterName” : “zeebe-cluster”,
“memberId” : “gateway”,
“host” : “0.0.0.0”,
“port” : 26502
},
“threads” : {
“managementThreads” : 1
},
“monitoring” : {
“enabled” : false,
“host” : “10.151.30.130”,
“port” : 9600
},
“security” : {
“enabled” : false,
“certificateChainPath” : null,
“privateKeyPath” : null
},
“enable” : false
},
“backpressure” : {
“enabled” : false,
“algorithm” : “VEGAS”
},
“stepTimeout” : “PT5M”
}

I executed a process without business processing, a total of six working nodes. It takes five minutes to complete the execution of a process.


This is clearly unacceptable. So, please ask, what should I do?

There are ten nodes in my application.
The client configuration is as follows:
zeebe.client.broker.contactPoint=common.zeebe.yp:26500
zeebe.client.security.plaintext=true
zeebe.client.worker.defaultName=worker-name
zeebe.client.worker.max-jobs-active=100
zeebe.client.job.poll-interval=100
zeebe.client.job.timeout=10

for(int i=0;i<100;i++){
for(int a=0;a<100;a++){
client.newCreateInstanceCommand()
.bpmnProcessId(“demoProcess”)
.latestVersion()
.variables("{“a”: “” + UUID.randomUUID().toString() + “”}")
.send();
}
}

I ran it five times in this way, and the gateway reported the error as follows:

2020-05-15 15:53:49.033 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [gateway-scheduler-zb-actors-0] ERROR io.zeebe.gateway - Error handling gRPC request
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: Time out between gateway and broker: Request type command-api-5 timed out in 30000 milliseconds
at io.grpc.Status.asRuntimeException(Status.java:524) ~[grpc-api-1.28.1.jar:1.28.1]
at io.zeebe.gateway.EndpointManager.convertThrowable(EndpointManager.java:397) ~[zeebe-gateway-0.23.1.jar:0.23.1]
at io.zeebe.gateway.impl.job.ActivateJobsHandler.logErrorResponse(ActivateJobsHandler.java:109) ~[zeebe-gateway-0.23.1.jar:0.23.1]
at io.zeebe.gateway.impl.job.ActivateJobsHandler.lambda$activateJobs$1(ActivateJobsHandler.java:96) ~[zeebe-gateway-0.23.1.jar:0.23.1]
at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequest$3(BrokerRequestManager.java:148) ~[zeebe-gateway-0.23.1.jar:0.23.1]
at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequestInternal$5(BrokerRequestManager.java:191) ~[zeebe-gateway-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:33) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:115) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.23.1.jar:0.23.1]
at io.zeebe.util.sched.ActorThread.run(ActorThread.java:195) [zeebe-util-0.23.1.jar:0.23.1]
Caused by: java.util.concurrent.TimeoutException: Request type command-api-5 timed out in 30000 milliseconds
at io.atomix.cluster.messaging.impl.AbstractClientConnection$Callback.timeout(AbstractClientConnection.java:163) ~[atomix-cluster-0.23.1.jar:0.23.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]

But my CPU is far from reaching the bottleneck, and the gateway and broker can no longer connect unless the gateway is restarted.

@wenminglei it will be really useful for us to identify the problem, if you push your code to a GitHub repository that we can run.

Can you give me some suggestions about my configuration? Is there any unreasonable configuration of my gateway, broker and client?

Hey @wenminglei

before you start to ask it is always wise to answer the following basic questions in one go, otherwise we probably can’t reproduce your issue nor help to solve it.

  • Which client do you use? Which version?
  • What is the Zeebe version you refering to?
  • What is exactly the scenario you want to achieve? Use case?
  • Please describe your expected behavior?
  • What is exactly the actual behavior?
  • What have you tried so far to solve this issue?
  • How does your configuration look like?

I saw that some of them already answered now. I will try to work with some of these details.

I started 100 threads and started 10,000 instances.

Where do you start 100 threads? For what and why?

“threads” : {
“managementThreads” : 1
},

Your gateway configuration shows that your gateway only uses one management thread, you should increase it to handle more traffic. Otherwise you will have the same issue as described in #4524.

I did not enable the back pressure mechanism.

What do you mean with that? This should be per default enabled.

This situation is very inconsistent with the data of this link.https://zeebe.io/blog/2018/06/benchmarking-zeebe-horizontal-scaling/

Be aware that this blog post is 2 years old. Furthermore this post refers only to workflow instance created events not completed!

I executed a process without business processing, a total of six working nodes. It takes five minutes to complete the execution of a process.

As you might know Zeebe uses a persistent log to store all of the data. New events/commands are appended to the end of the log. If there is a lot of traffic coming in, e. g. a lot of instances are created etc. This will also have an impact of the overall latency to complete a workflow instance. It also depends on the execution time of the workers of course and latency between gateway and broker. As we saw you use one thread on the gateway, which might be a bottle neck here.

It would be awesome if you could use the markdown formatting, such that it is easier for us to read your posted code and configuration.

client.newCreateInstanceCommand()
.bpmnProcessId(“demoProcess”)
.latestVersion()
.variables("{“a”: “” + UUID.randomUUID().toString() + “”}")
.send();

So you’re creating 10,000 instance async in one go, so you’re hammering with a sledge hammer against the gateway and broker. I’m wondering what is the expected load you have? Do you have an target load you’re looking for?

1 Like