Hi
I have the following Zeebe cluster running in our development environment:
Cluster size: 1
Partition Factor: 1
Replication Factor: 1
Zeebe version: 0.22.1
The cluster has both the ElasticSearch and Hazelcast exporters with Simple Monitor configured.
The Zeebe Cluster has been running fine for about 30 days but this morning the Zeebe Broker pod started failing. The logs report that the Hazelcast exporter cannot recover the partitions followed by a bunch of exception. Although the data isn’t that important the pod still hasn’t started in over 5 hours.
Any ideas what is causing there errors?
Thanks for your help,
Andy
2020-06-01 11:48:41.585 [] [hz._hzInstance_1_dev.cached.thread-7] INFO com.hazelcast.internal.cluster.impl.MembershipManager - [10.244.2.75]:5701 [dev] [3.12.5] Mastership is claimed with: MembersView{version=19, members=[MemberInfo{address=[10.244.2.75]:5701, uuid=c6572eb2-dde5-44fb-aaa7-6703c9df98c5, liteMember=false, memberListJoinVersion=18}]}
2020-06-01 11:49:13.389 [] [hz._hzInstance_1_dev.cached.thread-6] INFO com.hazelcast.transaction.TransactionManagerService - [10.244.2.75]:5701 [dev] [3.12.5] Committing/rolling-back live transactions of [10.244.2.64]:5701, UUID: 2e1080d9-d8b6-4a88-a04d-9c348eb58d1d
2020-06-01 11:49:33.807 [] [hz._hzInstance_1_dev.migration] INFO com.hazelcast.internal.partition.InternalPartitionService - [10.244.2.75]:5701 [dev] [3.12.5] Fetching most recent partition table! my version: 7449
2020-06-01 11:49:47.964 [] [hz._hzInstance_1_dev.migration] INFO com.hazelcast.internal.partition.InternalPartitionService - [10.244.2.75]:5701 [dev] [3.12.5] Most recent partition table version: 7449
2020-06-01 11:50:53.121 [] [hz._hzInstance_1_dev.migration] WARN com.hazelcast.internal.partition.impl.MigrationManager - [10.244.2.75]:5701 [dev] [3.12.5] partitionId=0 is completely lost!
2020-06-01 11:51:19.640 [] [hz._hzInstance_1_dev.migration] WARN com.hazelcast.internal.partition.impl.MigrationManager - [10.244.2.75]:5701 [dev] [3.12.5] partitionId=1 is completely lost!
2020-06-01 11:51:23.058 [] [hz._hzInstance_1_dev.migration] WARN com.hazelcast.internal.partition.impl.MigrationManager - [10.244.2.75]:5701 [dev] [3.12.5] partitionId=2 is completely lost!
....
2020-06-01 12:06:38.803 [] [hz._hzInstance_1_dev.migration] WARN com.hazelcast.internal.partition.impl.MigrationManager - [10.244.2.75]:5701 [dev] [3.12.5] partitionId=270 is completely lost!
2020-06-01 12:06:56.167 [] [hz._hzInstance_1_dev.IO.thread-in-2] INFO com.hazelcast.nio.tcp.TcpIpConnection - [10.244.2.75]:5701 [dev] [3.12.5] Initialized new cluster connection between /10.244.2.75:36609 and /10.244.2.76:5701
2020-06-01 12:07:02.672 [] [hz._hzInstance_1_dev.cached.thread-9] INFO com.hazelcast.internal.partition.InternalPartitionService - [10.244.2.75]:5701 [dev] [3.12.5] Remaining migration tasks in queue => 1. (lastRepartitionTime=Thu Jan 01 00:00:00 UTC 1970, completedMigrations=271, totalCompletedMigrations=271, elapsedMigrationTime=0ms, totalElapsedMigrationTime=0ms)
2020-06-01
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> [10.244.2.76]:5701 over Connection[id=40, /10.244.2.75:34599->/10.244.2.76:5701, qualifier=null, endpoint=[10.244.2.76]:5701, alive=false, type=MEMBER]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:92) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:179) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.AbstractJoiner.sendSplitBrainJoinMessage(AbstractJoiner.java:264) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.AbstractJoiner.sendSplitBrainJoinMessageAndCheckResponse(AbstractJoiner.java:229) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:129) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:75) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.SplitBrainHandler.run(SplitBrainHandler.java:42) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.executionservice.impl.DelegateAndSkipOnConcurrentExecutionDecorator$DelegateDecorator.run(DelegateAndSkipOnConcurrentExecutionDecorator.java:77) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [zeebe-hazelcast-exporter.jar:0.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [zeebe-hazelcast-exporter.jar:0.7.0]
Caused by: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> [10.244.2.76]:5701 over Connection[id=40, /10.244.2.75:34599->/10.244.2.76:5701, qualifier=null, endpoint=[10.244.2.76]:5701, alive=false, type=MEMBER]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvokeRemote(Invocation.java:657) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:631) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke0(Invocation.java:592) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:256) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl.invoke(InvocationBuilderImpl.java:61) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.AbstractJoiner.sendSplitBrainJoinMessage(AbstractJoiner.java:262) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.AbstractJoiner.sendSplitBrainJoinMessageAndCheckResponse(AbstractJoiner.java:229) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:129) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:75) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.cluster.impl.SplitBrainHandler.run(SplitBrainHandler.java:42) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.executionservice.impl.DelegateAndSkipOnConcurrentExecutionDecorator$DelegateDecorator.run(DelegateAndSkipOnConcurrentExecutionDecorator.java:77) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at ------ submitted from ------.(Unknown Source) ~[?:?]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:126) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79) ~[zeebe-hazelcast-exporter.jar:0.7.0]
... 13 more
2020-06-01 14:48:06.204 [] [hz._hzInstance_1_dev.cached.thread-15] INFO com.hazelcast.internal.cluster.impl.MembershipManager - [10.244.2.75]:5701 [dev] [3.12.5] Removing Member [10.244.2.76]:5701 - 5181e962-c86e-40db-a8bc-d65ad274437b
2020-06-01 14:48:06.355 [] [hz._hzInstance_1_dev.cached.thread-11] INFO com.hazelcast.internal.partition.InternalPartitionService - [10.244.2.75]:5701 [dev] [3.12.5] Remaining migration tasks in queue => 2. (lastRepartitionTime=Mon Jun 01 12:51:55 UTC 2020, completedMigrations=23, totalCompletedMigrations=294, elapsedMigrationTi
0-06-01 15:04:24.130 [] [hz._hzInstance_1_dev.priority-generic-operation.thread-0] INFO com.hazelcast.internal.cluster.impl.ClusterJoinManager - [10.244.2.75]:5701 [dev] [3.12.5] [10.244.2.76]:5701 should merge to us , both have the same data member count: 1
2020-06-01 15:04:29.527 [] [hz._hzInstance_1_dev.generic-operation.thread-0] INFO com.hazelcast.internal.cluster.impl.ClusterJoinManager - [10.244.2.75]:5701 [dev] [3.12.5] [10.244.2.76]:5701 should merge to us , both have the same data member count: 1
2020-06-01 15:04:30.841 [] [hz._hzInstance_1_dev.cached.thread-9] INFO com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager - [10.244.2.75]:5701 [dev] [3.12.5] System clock apparently jumped from 2020-06-01 15:00:45.301 to 2020-06-01 15:03:51.673 since last heartbeat (+181372 ms)
2020-06-01 15:04:20.607 [] [hz._hzInstance_1_dev.cached.thread-13] INFO com.hazelcast.internal.partition.InternalPartitionService - [10.244.2.75]:5701 [dev] [3.12.5] Remaining migration tasks in queue => 2. (lastRepartitionTime=Mon Jun 01 12:51:55 UTC 2020, completedMigrations=23, totalCompletedMigrations=294, elapsedMigrationTime=1865294ms, totalElapsedMigrationTime=1865294ms)
2020-06-01 15:04:36.022 [] [hz._hzInstance_1_dev.cached.thread-9] WARN com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager - [10.244.2.75]:5701 [dev] [3.12.5] Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 181372 ms, Heartbeat-Timeout: 60000 ms
2020-06-01 15:03:38.565 [] [hz._hzInstance_1_dev.migration] ERROR com.hazelcast.internal.partition.impl.MigrationManager - [10.244.2.75]:5701 [dev] [3.12.5] Promotion commit to [10.244.2.75]:5701 - c6572eb2-dde5-44fb-aaa7-6703c9df98c5 failed for 22 migrations
java.util.concurrent.ExecutionException: PromotionCommitOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2020-06-01 15:01:43.338. Start time: 2020-06-01 14:52:39.675. Total elapsed time: 575442 ms. Invocation{op=com.hazelcast.internal.partition.operation.PromotionCommitOperation{serviceName='hz:core:partitionService', identityHash=2052331847, partitionId=-1, replicaIndex=0, callId=-708, invocationTime=1591020079378 (2020-06-01 14:01:19.378), waitTimeout=-1, callTimeout=60000}, tryCount=2147483647, tryPauseMillis=500, invokeCount=2, callTimeoutMillis=60000, firstInvocationTimeMs=1591023159675, firstInvocationTime='2020-06-01 14:52:39.675', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.244.2.75]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=1, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:163) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:104) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:163) ~[zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionsToDestination(MigrationManager.java:1476) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionsToDestination(MigrationManager.java:1486) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionsToDestination(MigrationManager.java:1486) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionsToDestination(MigrationManager.java:1486) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionsToDestination(MigrationManager.java:1486) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.commitPromotionMigrations(MigrationManager.java:1360) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.promoteBackupsForMissingOwners(MigrationManager.java:1346) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationManager$RepairPartitionTableTask.run(MigrationManager.java:1286) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationThread.processTask(MigrationThread.java:122) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationThread.doRun(MigrationThread.java:98) [zeebe-hazelcast-exporter.jar:0.7.0]
at com.hazelcast.internal.partition.impl.MigrationThread.run(MigrationThread.java:67) [zeebe-hazelcast-exporter.jar:0.7.0]
Caused by: com.hazelcast.core.OperationTimeoutException: PromotionCommitOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms. Current time: 2020-06-01 15:01:43.338. Start time: 2020-06-01 14:52:39.675. Total elapsed time: 575442 ms. Invocation{op=com.hazelcast.internal.partition.operation.PromotionCommitOperation{serviceName='hz:core:partitionService', identityHash=2052331847, partitionId=-1, replicaIndex=0, callId=-708, invocationTime=1591020079378 (2020-06-01 14:01:19.378), waitTimeout=-1, callTimeout=60000}, tryCount=2147483647, tryPauseMillis=500, invokeCount=2, callTimeoutMillis=60000, firstInvocationTimeMs=1591023159675, firstInvocationTime='2020-06-01 14:52:39.675', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.244.2.75]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=1, connection=null}