Broker fails to start after increasing partitions

Hi,
I have installted a Zeebe cluster in Kubernetes using helm charts. The details are as follows:

Kubernetes client and server: v1.15.3
Platform: GNU/Linux
Helm version: v2.14.3

Zeebe image: camunda/zeebe:0.25.0
Other env variables:

- name: ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT
  value: "3"
- name: ZEEBE_BROKER_CLUSTER_CLUSTERSIZE
  value: "3"
- name: ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR
  value: "2"
- name: ZEEBE_BROKER_THREADS_CPUTHREADCOUNT
  value: "2"
- name: ZEEBE_BROKER_THREADS_IOTHREADCOUNT
  value: "2"

Exporters configured: kafka and hazelcast. ElasticSearch is disbaled.
Workers configured: http-worker

This all works well when partitions count is set to 3 (same as cluster size). But as soon as I change the partitions to > cluster size, the brokers do not come up. For example, I made the following changes in the env variables:

  • name: ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT
    value: “7”
  • name: ZEEBE_BROKER_THREADS_CPUTHREADCOUNT
    value: “7”.
    (I am running this on an 8 core machine)
    After this, brokers do not come up and ofcourse gateway fail to recognize them.
    There are no errors in the logs, though there are warnings from raft-server-partitions. There is enough disk space and volume.

The following is the first part of logs from broker-0.

2021-01-25 09:12:30.530 [] [main] INFO io.zeebe.broker.StandaloneBroker - Starting StandaloneBroker v0.25.0 on zeebe-broker-0 with PID 6 (/usr/local/zeebe/lib/zeebe-distribution-0.25.0.jar started by root in /usr/local/zeebe)
2021-01-25 09:12:30.591 [] [main] DEBUG io.zeebe.broker.StandaloneBroker - Running with Spring Boot v2.3.4.RELEASE, Spring v5.2.9.RELEASE
2021-01-25 09:12:30.592 [] [main] INFO io.zeebe.broker.StandaloneBroker - No active profile set, falling back to default profiles: default
2021-01-25 09:12:34.298 [] [main] INFO org.springframework.boot.web.embedded.tomcat.TomcatWebServer - Tomcat initialized with port(s): 9600 (http)
2021-01-25 09:12:34.309 [] [main] INFO org.apache.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler [“http-nio-0.0.0.0-9600”]
2021-01-25 09:12:34.310 [] [main] INFO org.apache.catalina.core.StandardService - Starting service [Tomcat]
2021-01-25 09:12:34.310 [] [main] INFO org.apache.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.39]
2021-01-25 09:12:34.516 [] [main] INFO org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/] - Initializing Spring embedded WebApplicationContext
2021-01-25 09:12:34.517 [] [main] INFO org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext - Root WebApplicationContext: initialization completed in 3821 ms
2021-01-25 09:12:35.613 [] [main] INFO org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor - Initializing ExecutorService ‘applicationTaskExecutor’
2021-01-25 09:12:36.422 [] [main] INFO org.springframework.boot.actuate.endpoint.web.EndpointLinksResolver - Exposing 4 endpoint(s) beneath base path ‘/actuator’
2021-01-25 09:12:36.501 [] [main] INFO org.apache.coyote.http11.Http11NioProtocol - Starting ProtocolHandler [“http-nio-0.0.0.0-9600”]
2021-01-25 09:12:36.523 [] [main] INFO org.springframework.boot.web.embedded.tomcat.TomcatWebServer - Tomcat started on port(s): 9600 (http) with context path ‘’
2021-01-25 09:12:36.601 [] [main] INFO io.zeebe.broker.StandaloneBroker - Started StandaloneBroker in 7.099 seconds (JVM running for 10.052)
2021-01-25 09:12:36.626 [] [main] DEBUG io.zeebe.broker.system - Initializing system with base path /usr/local/zeebe
2021-01-25 09:12:36.713 [] [main] INFO io.zeebe.broker.system - Version: 0.25.0
2021-01-25 09:12:36.823 [] [main] INFO io.zeebe.broker.system - Starting broker 0 with configuration {
“network” : {
“host” : “0.0.0.0”,
“portOffset” : 0,
“maxMessageSize” : “4MB”,
“advertisedHost” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local”,
“commandApi” : {
“host” : “0.0.0.0”,
“port” : 26501,
“advertisedHost” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local”,
“advertisedPort” : 26501,
“advertisedAddress” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26501”,
“address” : “0.0.0.0:26501”
},
“internalApi” : {
“host” : “0.0.0.0”,
“port” : 26502,
“advertisedHost” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local”,
“advertisedPort” : 26502,
“advertisedAddress” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502”,
“address” : “0.0.0.0:26502”
},
“monitoringApi” : {
“host” : “0.0.0.0”,
“port” : 9600,
“advertisedHost” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local”,
“advertisedPort” : 9600,
“advertisedAddress” : “zeebe-broker-0.zeebe-broker.default.svc.cluster.local:9600”,
“address” : “0.0.0.0:9600”
},
“maxMessageSizeInBytes” : 4194304
},
“cluster” : {
“initialContactPoints” : [ “zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502”, “zeebe-broker-1.zeebe-broker.default.svc.cluster.local:26502”, “zeebe-broker-2.zeebe-broker.default.svc.cluster.local:26502” ],
“partitionIds” : [ 1, 2, 3, 4, 5, 6, 7 ],
“nodeId” : 0,
“partitionsCount” : 7,
“replicationFactor” : 2,
“clusterSize” : 3,
“clusterName” : “zeebe-broker”,
“membership” : {
“broadcastUpdates” : false,
“broadcastDisputes” : true,
“notifySuspect” : false,
“gossipInterval” : “PT0.25S”,
“gossipFanout” : 2,
“probeInterval” : “PT1S”,
“probeTimeout” : “PT2S”,
“suspectProbes” : 3,
“failureTimeout” : “PT10S”,
“syncInterval” : “PT10S”
}
},
“threads” : {
“cpuThreadCount” : 7,
“ioThreadCount” : 2
},
“data” : {
“directories” : [ “/usr/local/zeebe/data” ],
“logSegmentSize” : “512MB”,
“snapshotPeriod” : “PT15M”,
“logIndexDensity” : 100,
“diskUsageMonitoringEnabled” : true,
“diskUsageReplicationWatermark” : 0.99,
“diskUsageCommandWatermark” : 0.97,
“diskUsageMonitoringInterval” : “PT1S”,
“rocksdb” : {
“columnFamilyOptions” : { }
},
“atomixStorageLevel” : “DISK”,
“freeDiskSpaceCommandWatermark” : 3220879442,
“freeDiskSpaceReplicationWatermark” : 1073626481,
“logSegmentSizeInBytes” : 536870912
},
“exporters” : {
……
}

“gateway” : {
“network” : {
“host” : “0.0.0.0”,
“port” : 26500,
“minKeepAliveInterval” : “PT30S”
},
“cluster” : {
“contactPoint” : “0.0.0.0:26502”,
“requestTimeout” : “PT15S”,
“clusterName” : “zeebe-cluster”,
“memberId” : “gateway”,
“host” : “0.0.0.0”,
“port” : 26502,
“membership” : {
“broadcastUpdates” : false,
“broadcastDisputes” : true,
“notifySuspect” : false,
“gossipInterval” : “PT0.25S”,
“gossipFanout” : 2,
“probeInterval” : “PT1S”,
“probeTimeout” : “PT2S”,
“suspectProbes” : 3,
“failureTimeout” : “PT10S”,
“syncInterval” : “PT10S”
}
},
“threads” : {
“managementThreads” : 1
},
“monitoring” : {
“enabled” : false,
“host” : “0.0.0.0”,
“port” : 9600
},
“security” : {
“enabled” : false,
“certificateChainPath” : null,
“privateKeyPath” : null
},
“longPolling” : {
“enabled” : true
},
“initialized” : true,
“enable” : false
},
“backpressure” : {
“enabled” : true,
“algorithm” : “VEGAS”,
“aimd” : {
“requestTimeout” : “PT1S”,
“initialLimit” : 100,
“minLimit” : 1,
“maxLimit” : 1000,
“backoffRatio” : 0.9
},
“fixed” : {
“limit” : 20
},
“vegas” : {
“alpha” : 3,
“beta” : 6,
“initialLimit” : 20
},
“gradient” : {
“minLimit” : 10,
“initialLimit” : 20,
“rttTolerance” : 2.0
},
“gradient2” : {
“minLimit” : 10,
“initialLimit” : 20,
“rttTolerance” : 2.0,
“longWindow” : 600
}
},
“experimental” : {
“maxAppendsPerFollower” : 2,
“maxAppendBatchSize” : “0MB”,
“disableExplicitRaftFlush” : false,
“maxAppendBatchSizeInBytes” : 32768
},
“stepTimeout” : “PT5M”,
“executionMetricsExporterEnabled” : false
}

2021-01-26 04:34:04.385 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [1/13]: actor scheduler
2021-01-26 04:34:04.422 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [1/13]: actor scheduler started in 36 ms
2021-01-26 04:34:04.423 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [2/13]: membership and replication protocol
2021-01-26 04:34:04.490 [] [main] DEBUG io.zeebe.broker.clustering - Member 0 will contact node: zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502
2021-01-26 04:34:04.497 [] [main] DEBUG io.zeebe.broker.clustering - Member 0 will contact node: zeebe-broker-1.zeebe-broker.default.svc.cluster.local:26502
2021-01-26 04:34:04.500 [] [main] DEBUG io.zeebe.broker.clustering - Member 0 will contact node: zeebe-broker-2.zeebe-broker.default.svc.cluster.local:26502
2021-01-26 04:34:04.985 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [2/13]: membership and replication protocol started in 562 ms
2021-01-26 04:34:04.986 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [3/13]: command api transport
2021-01-26 04:34:05.391 [] [main] DEBUG io.zeebe.broker.system - Bound command API to zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26501
2021-01-26 04:34:05.402 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [3/13]: command api transport started in 416 ms
2021-01-26 04:34:05.402 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [4/13]: command api handler
2021-01-26 04:34:05.505 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [4/13]: command api handler started in 103 ms
2021-01-26 04:34:05.505 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [5/13]: subscription api
2021-01-26 04:34:05.594 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [5/13]: subscription api started in 88 ms
2021-01-26 04:34:05.594 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-0 [6/13]: cluster services
2021-01-26 04:34:08.379 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet ‘dispatcherServlet’
2021-01-26 04:34:08.379 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.springframework.web.servlet.DispatcherServlet - Initializing Servlet ‘dispatcherServlet’
2021-01-26 04:34:08.388 [] [http-nio-0.0.0.0-9600-exec-1] INFO org.springframework.web.servlet.DispatcherServlet - Completed initialization in 9 ms

After this point warnings from Raft-server starts and broker never catches up. The warnings are like:

2021-01-26 04:34:10.882 [] [raft-server-0-raft-partition-partition-7] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 1 failed: java.net.ConnectException: Expected to send a message with subject ‘raft-partition-partition-7-poll’ to member ‘1’, but member is not known. Known members are ‘[Member{id=zeebe-broker-gateway-b759d674d-p4nl4, address=10.233.91.226:26502, properties={event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502, properties={}}]’.
2021-01-26 04:34:10.882 [] [raft-server-0-raft-partition-partition-7] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-7}{role=FOLLOWER} - Poll request to 2 failed: java.net.ConnectException: Expected to send a message with subject ‘raft-partition-partition-7-poll’ to member ‘2’, but member is not known. Known members are ‘[Member{id=zeebe-broker-gateway-b759d674d-p4nl4, address=10.233.91.226:26502, properties={event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502, properties={}}]’.
2021-01-26 04:34:11.298 [] [raft-server-0-raft-partition-partition-6] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-6}{role=FOLLOWER} - Poll request to 1 failed: java.net.ConnectException: Expected to send a message with subject ‘raft-partition-partition-6-poll’ to member ‘1’, but member is not known. Known members are ‘[Member{id=zeebe-broker-gateway-b759d674d-p4nl4, address=10.233.91.226:26502, properties={event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502, properties={}}]’.

Some more information from statefulset, if it helps:
replicas: 3

  • name: JAVA_TOOL_OPTIONS
    value: -Xms4g -Xmx6g -XX:MaxRAMPercentage=25.0 -XX:+HeapDumpOnOutOfMemoryError
    -XX:HeapDumpPath=/usr/local/zeebe/data -XX:ErrorFile=/usr/local/zeebe/data/zeebe_error%p.log
    -XX:+ExitOnOutOfMemoryError

Is there anything I am missing here to help increase the partitions?

Thanks

Hi @kpsandhya and welcome to our forums!

The first thing I noticed is that you’ve set ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR to 2. We generally recommend an odd value here. Perhaps you can try again with 1 or 3. There’s an ongoing discussion about whether we want to allow even values for this option, but in any case we see some users with stability issues with an even replication factor.

I’d be curious to hear whether this helps

Hi @korthout,
Thank you for looking into this and for the suggestion. I tried with these parameters:
Partitions: 4, replicationFactor: 1, CpuThreadCount: 4, Cluster size: 3
But, the broker stops at the same exact point in logs and starts throwing warnings from raft-server-partitions.

“cluster” : {
“initialContactPoints” : [ “zeebe-broker-0.zeebe-broker.default.svc.cluster.local:26502”, “zeebe-broker-1.zeebe-broker.default.svc.cluster.local:26502”, “zeebe-broker-2.zeebe-broker.default.svc.cluster.local:26502” ],
“partitionIds” : [ 1, 2, 3, 4 ],
“nodeId” : 2,
“partitionsCount” : 4,
“replicationFactor” : 1,
“clusterSize” : 3,
“clusterName” : “zeebe-broker”,
“membership” : {
“broadcastUpdates” : false,
“broadcastDisputes” : true,
“notifySuspect” : false,
“gossipInterval” : “PT0.25S”,
“gossipFanout” : 2,
“probeInterval” : “PT1S”,
“probeTimeout” : “PT2S”,
“suspectProbes” : 3,
“failureTimeout” : “PT10S”,
“syncInterval” : “PT10S”
}
},
“threads” : {
“cpuThreadCount” : 4,
“ioThreadCount” : 2
},

2021-01-26 09:02:48.355 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-2 [5/13]: subscription api started in 78 ms
2021-01-26 09:02:48.356 [] [main] INFO io.zeebe.broker.system - Bootstrap Broker-2 [6/13]: cluster services
2021-01-26 09:02:48.977 [] [http-nio-0.0.0.0-9600-exec-2] INFO org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet ‘dispatcherServlet’
2021-01-26 09:02:48.978 [] [http-nio-0.0.0.0-9600-exec-2] INFO org.springframework.web.servlet.DispatcherServlet - Initializing Servlet ‘dispatcherServlet’
2021-01-26 09:02:48.986 [] [http-nio-0.0.0.0-9600-exec-2] INFO org.springframework.web.servlet.DispatcherServlet - Completed initialization in 8 ms
2021-01-26 09:02:51.781 [] [raft-server-2-raft-partition-partition-3] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 3 failed: java.net.ConnectException: Expected to send a message with subject ‘raft-partition-partition-3-poll’ to member ‘3’, but member is not known. Known members are ‘[Member{id=1, address=zeebe-broker-1.zeebe-broker.default.svc.cluster.local:26502, properties={}}, Member{id=2, address=zeebe-broker-2.zeebe-broker.default.svc.cluster.local:26502, properties={}}]’.
2021-01-26 09:02:51.781 [] [raft-server-2-raft-partition-partition-3] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 0 failed: java.net.ConnectException: Expected to send a message with subject ‘raft-partition-partition-3-poll’ to member ‘0’, but member is not known. Known members are ‘[Member{id=1, address=zeebe-broker-1.zeebe-broker.default.svc.cluster.local:26502, properties={}}, Member{id=2, address=zeebe-broker-2.zeebe-broker.default.svc.cluster.local:26502, properties={}}]’.
2021-01-26 09:02:51.784 [] [raft-server-2-raft-partition-partition-3] WARN io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 1 failed: io.atomix.cluster.messaging.MessagingException$NoRemoteHandler: No remote message handler registered for this message

Thanks

After discussing with @deepthi we found what’s wrong. Zeebe does not support changing the number of partitions or the cluster size of an existing cluster. To fix it, stop your cluster, clean out your data folder and restart the cluster.

1 Like

Hi @korthout
Thanks much for this solution. It worked after cleaning up the data volume. We could do this, because it was in the development environment. Is there any work around / steps to follow to retain data if we need to do the same in say production environment?

Glad to hear this solved it.

At this time Zeebe does not support changing the partitionCount, clusterSize and replicationFactor. You should experiment to find what settings work for your use case and then move to production. We’d be curious whether such a feature is important to you. If so, please share it in this issue.