Handling cluster restarts

I had a working 3 broker and standalone gateway cluster spread across three VMs. Two of the VMs had a memory overload and crashed. I restarted all the services and below are my observations.

  1. The workflows were lost and I had to redeploy them.
  2. Operate shows stale data. Instances are getting processed but they do not show on operate.

Error log from operate:

zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | 2020-10-14 09:48:20.067 ERROR 6 — [ archiver_1] o.c.o.a.AbstractArchiverJob : Error occurred while archiving data. Will be retried.
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | org.camunda.operate.exceptions.OperateRuntimeException: Exception occurred, while reindexing the documents: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:172) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.moveDocuments(Archiver.java:106) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.WorkflowInstancesArchiverJob.archiveBatch(WorkflowInstancesArchiverJob.java:149) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.archiveNextBatch(AbstractArchiverJob.java:70) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.run(AbstractArchiverJob.java:51) [camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.7.RELEASE.jar!/:5.2.7.RELEASE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.lang.Thread.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:944) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1764) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1734) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1696) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.reindex(RestHighLevelClient.java:516) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.lambda$reindexDocuments$1(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at io.micrometer.core.instrument.AbstractTimer.recordCallable(AbstractTimer.java:138) ~[micrometer-core-1.5.2.jar!/:1.5.2]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexWithTimer(Archiver.java:119) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 11 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 1 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |

Error log from broker:

zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | 2020-10-14 09:50:34.362 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-0] ERROR io.zeebe.broker.exporter.elasticsearch - Unexpected exception occurred on periodically flushing bulk, will retry later.
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | io.zeebe.exporter.ElasticsearchExporterException: Failed to flush all items of the bulk
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchClient.flush(ElasticsearchClient.java:153) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flush(ElasticsearchExporter.java:116) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flushAndReschedule(ElasticsearchExporter.java:103) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.2.jar:0.24.2]

Elasticsearch is external and was not affected. Can someone help me understand what actually happened here and how to handle cluster failures?

Please share the configuration files

version: "3"

networks:
  zeebe_swarm:
    driver: overlay
  agent_network:
    driver: overlay
    # attachable: true
  
volumes:
  zeebe_elasticsearch_data:
  portainer_data:

services:
  gateway:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=true
      # host and port where gateway connects to broker for first contact
      - ZEEBE_GATEWAY_CLUSTER_CONTACTPOINT=node0:26502
      # host and port where clients connect
      - ZEEBE_GATEWAY_NETWORK_PORT=26500
      - ZEEBE_GATEWAY_MONITORING_PORT=9600
    ports:
      - "26500:26500"
      - "9600:9600"
    networks:
      - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M

  node0:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=0
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9701
    ports:
      - "9701:9701"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M
  
  node1:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=1
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9702
    ports:
      - "9702:9702"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    depends_on:
      - node0
    deploy:
      resources:
        limits:
          memory: 1024M
  
  node2:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=2
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9703
    ports:
      - "9703:9703"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    depends_on:
      - node1
    deploy:
      resources:
        limits:
          memory: 1024M
  
  operate:
    image: camunda/operate:0.24.2
    ports:
      - "8080:8080"
    volumes:
      - ./operate_application.yaml:/usr/local/operate/config/application.yml
    depends_on:
      - node2
    networks:
    - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M

Can anyone help with this?