I had a working 3 broker and standalone gateway cluster spread across three VMs. Two of the VMs had a memory overload and crashed. I restarted all the services and below are my observations.
- The workflows were lost and I had to redeploy them.
- Operate shows stale data. Instances are getting processed but they do not show on operate.
Error log from operate:
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | 2020-10-14 09:48:20.067 ERROR 6 — [ archiver_1] o.c.o.a.AbstractArchiverJob : Error occurred while archiving data. Will be retried.
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | org.camunda.operate.exceptions.OperateRuntimeException: Exception occurred, while reindexing the documents: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:172) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.moveDocuments(Archiver.java:106) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.WorkflowInstancesArchiverJob.archiveBatch(WorkflowInstancesArchiverJob.java:149) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.archiveNextBatch(AbstractArchiverJob.java:70) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.run(AbstractArchiverJob.java:51) [camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.7.RELEASE.jar!/:5.2.7.RELEASE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.lang.Thread.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:944) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1764) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1734) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1696) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.reindex(RestHighLevelClient.java:516) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.lambda$reindexDocuments$1(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at io.micrometer.core.instrument.AbstractTimer.recordCallable(AbstractTimer.java:138) ~[micrometer-core-1.5.2.jar!/:1.5.2]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexWithTimer(Archiver.java:119) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 11 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 1 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |
Error log from broker:
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | 2020-10-14 09:50:34.362 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-0] ERROR io.zeebe.broker.exporter.elasticsearch - Unexpected exception occurred on periodically flushing bulk, will retry later.
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | io.zeebe.exporter.ElasticsearchExporterException: Failed to flush all items of the bulk
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchClient.flush(ElasticsearchClient.java:153) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flush(ElasticsearchExporter.java:116) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flushAndReschedule(ElasticsearchExporter.java:103) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.2.jar:0.24.2]
Elasticsearch is external and was not affected. Can someone help me understand what actually happened here and how to handle cluster failures?