Handling cluster restarts

I had a working 3 broker and standalone gateway cluster spread across three VMs. Two of the VMs had a memory overload and crashed. I restarted all the services and below are my observations.

  1. The workflows were lost and I had to redeploy them.
  2. Operate shows stale data. Instances are getting processed but they do not show on operate.

Error log from operate:

zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | 2020-10-14 09:48:20.067 ERROR 6 — [ archiver_1] o.c.o.a.AbstractArchiverJob : Error occurred while archiving data. Will be retried.
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | org.camunda.operate.exceptions.OperateRuntimeException: Exception occurred, while reindexing the documents: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:172) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.moveDocuments(Archiver.java:106) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.WorkflowInstancesArchiverJob.archiveBatch(WorkflowInstancesArchiverJob.java:149) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.archiveNextBatch(AbstractArchiverJob.java:70) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.AbstractArchiverJob.run(AbstractArchiverJob.java:51) [camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.7.RELEASE.jar!/:5.2.7.RELEASE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at java.lang.Thread.run(Unknown Source) [?:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:944) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) ~[elasticsearch-rest-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1764) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1734) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1696) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.elasticsearch.client.RestHighLevelClient.reindex(RestHighLevelClient.java:516) ~[elasticsearch-rest-high-level-client-6.8.10.jar!/:6.8.10]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.lambda$reindexDocuments$1(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at io.micrometer.core.instrument.AbstractTimer.recordCallable(AbstractTimer.java:138) ~[micrometer-core-1.5.2.jar!/:1.5.2]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexWithTimer(Archiver.java:119) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.camunda.operate.archiver.Archiver.reindexDocuments(Archiver.java:165) ~[camunda-operate-archiver-0.24.2.jar!/:?]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 11 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5955 [ACTIVE]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 | … 1 more
zeebe_operate.1.eyygdp337cht@ip-172-31-44-55 |

Error log from broker:

zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | 2020-10-14 09:50:34.362 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-0] ERROR io.zeebe.broker.exporter.elasticsearch - Unexpected exception occurred on periodically flushing bulk, will retry later.
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | io.zeebe.exporter.ElasticsearchExporterException: Failed to flush all items of the bulk
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchClient.flush(ElasticsearchClient.java:153) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flush(ElasticsearchExporter.java:116) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.exporter.ElasticsearchExporter.flushAndReschedule(ElasticsearchExporter.java:103) ~[zeebe-elasticsearch-exporter-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:118) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.24.2.jar:0.24.2]
zeebe_node0.1.uta4t9n1pknf@ip-172-31-32-27 | at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.24.2.jar:0.24.2]

Elasticsearch is external and was not affected. Can someone help me understand what actually happened here and how to handle cluster failures?

Please share the configuration files

version: "3"

networks:
  zeebe_swarm:
    driver: overlay
  agent_network:
    driver: overlay
    # attachable: true
  
volumes:
  zeebe_elasticsearch_data:
  portainer_data:

services:
  gateway:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=true
      # host and port where gateway connects to broker for first contact
      - ZEEBE_GATEWAY_CLUSTER_CONTACTPOINT=node0:26502
      # host and port where clients connect
      - ZEEBE_GATEWAY_NETWORK_PORT=26500
      - ZEEBE_GATEWAY_MONITORING_PORT=9600
    ports:
      - "26500:26500"
      - "9600:9600"
    networks:
      - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M

  node0:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=0
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9701
    ports:
      - "9701:9701"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M
  
  node1:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=1
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9702
    ports:
      - "9702:9702"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    depends_on:
      - node0
    deploy:
      resources:
        limits:
          memory: 1024M
  
  node2:
    image: camunda/zeebe:0.24.2
    environment:
      - ZEEBE_LOG_LEVEL=debug
      - ZEEBE_STANDALONE_GATEWAY=false
      - ZEEBE_BROKER_GATEWAY_ENABLE=false
      - ZEEBE_NODE_ID=2
      - ZEEBE_PARTITIONS_COUNT=2
      - ZEEBE_REPLICATION_FACTOR=3
      - ZEEBE_CLUSTER_SIZE=3
      - ZEEBE_CONTACT_POINTS=node0:26502,node1:26502,node2:26502
      - ZEEBE_BROKER_NETWORK_MONITORINGAPI_PORT=9703
    ports:
      - "9703:9703"
    volumes:
      - ./application.yaml:/usr/local/zeebe/config/application.yaml  
    networks:
      - zeebe_swarm
    depends_on:
      - node1
    deploy:
      resources:
        limits:
          memory: 1024M
  
  operate:
    image: camunda/operate:0.24.2
    ports:
      - "8080:8080"
    volumes:
      - ./operate_application.yaml:/usr/local/operate/config/application.yml
    depends_on:
      - node2
    networks:
    - zeebe_swarm
    deploy:
      resources:
        limits:
          memory: 1024M

Can anyone help with this?

I was able to figure out the reason for this one. There was stale data on ES from previous deployments because of which the brokers were not able to flush snapshots to ES. I deleted all the stale indexes and restarted the cluster and everything seems to work fine now.

1 Like