Standalone gateway not option in Zeebe-Cluster-Helm chart

Hi

I recently noticed that the zeebe-cluster-helm chart includes a standalone gateway deployment, where previous versions assumed the gateway to be embedded. https://github.com/zeebe-io/zeebe-cluster-helm/blob/master/zeebe-cluster/templates/gateway-deployment.yaml

We are deploying Zeebe into a Kubernetes cluster that uses Istio and from what I can see it doesn’t look as if the standalone gateway deployment is optional, but as I understood (perhaps incorrectly) the standalone gateway wasn’t necessarily if an alternative load balancer was in use.

I know there are performance gains to get from using a standalone gateway (see Standalone vs Embedded Gateway - #2 by jwulf - Zeebe Broker - Camunda Cloud, powered by the Zeebe Engine), but this does add complexity to the release. Are there benefits to using the standalone gateway now that makes the added complexity worthwhile?

Thanks
Andy

@andyh yes, we recently included the standalone gateway as the default topology for Helm charts, in the beginning we had the embedded gateway because it was easier but for development purposes.
Can you please share with us what kind of problems are you having? I would rather make the charts support Istio enabled deployments than recommend people using Embedded Gateways.

Hi @salaboy

I’m not sure if this is an issue with the latest helm chart, which is why I haven’t raised an issue on GitHub. If it turns out to be a deployment issue I’d be happy to raise an issue there.

Since deploying zeebe-cluster-helm chart version 0.0.88 the broker bootstrap process takes an excessive amount of time to complete, over 20 min.

Looking at the zeebe-cluster logs I can see that the Bootstrap Broker-0 [6/10]: cluster service step took 1280707ms.

logs

2020-04-14 10:49:53.525 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [3/10]: command api transport started in 416 ms
2020-04-14 10:49:53.525 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [4/10]: command api handler
2020-04-14 10:49:53.617 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [4/10]: command api handler started in 91 ms
2020-04-14 10:49:53.618 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [5/10]: subscription api
2020-04-14 10:49:53.810 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [5/10]: subscription api started in 192 ms
2020-04-14 10:49:53.811 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [6/10]: cluster services
2020-04-14 11:11:14.518 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [6/10]: cluster services started in 1280707 ms
2020-04-14 11:11:14.519 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [7/10]: topology manager
2020-04-14 11:11:14.520 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [7/10]: topology manager started in 1 ms
2020-04-14 11:11:14.521 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [8/10]: metric's server
2020-04-14 11:11:14.531 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [8/10]: metric's server started in 10 ms
2020-04-14 11:11:14.532 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [9/10]: leader management request handler
2020-04-14 11:11:14.533 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [9/10]: leader management request handler started in 1 ms
2020-04-14 11:11:14.534 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 [10/10]: zeebe partitions
2020-04-14 11:11:14.536 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions [1/1]: partition 1
2020-04-14 11:11:14.881 [] [main] DEBUG io.zeebe.broker.exporter - Exporter configured with ElasticsearchExporterConfiguration{url='http://elasticsearch-master:9200', index=IndexConfiguration{indexPrefix='zeebe-rec
ord', createTemplate=true, command=false, event=true, rejection=false, error=true, deployment=true, incident=true, job=true, message=false, messageSubscription=false, variable=true, variableDocument=true, workflowI
nstance=true, workflowInstanceCreation=false, workflowInstanceSubscription=false}, bulk=BulkConfiguration{delay=5, size=1000}, authentication=AuthenticationConfiguration{username='null'}}
2020-04-14 11:11:15.051 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.broker.system - Removing follower partition service for partition PartitionId{id=1, group=raft-partition}
2020-04-14 11:11:15.115 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.broker.system - Partition role transitioning from null to LEADER
2020-04-14 11:11:15.115 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.broker.system - Installing leader partition service for partition PartitionId{id=1, group=raft-partition}
2020-04-14 11:11:15.532 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.logstreams.snapshot - Available snapshots: [SnapshotImpl{position=38655128624, path=/usr/local/zeebe/data/raft-partition/par
titions/1/snapshots/6054-38-1586267129817-38655128624}, SnapshotImpl{position=38655109208, path=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/5997-38-1586266229810-38655109208}, SnapshotImpl{position=
38655094760, path=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/5955-38-1586265329782-38655094760}]
2020-04-14 11:11:16.468 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.logstreams.snapshot - Opened database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'.
2020-04-14 11:11:16.470 [Broker-0-ZeebePartition-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.logstreams.snapshot - Recovered state from snapshot 'SnapshotImpl{position=38655128624, path=/usr/local/zeebe/data/raft-part
ition/partitions/1/snapshots/6054-38-1586267129817-38655128624}'
2020-04-14 11:11:16.722 [Broker-0-LogStream-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.logstreams - Configured log appender back pressure at partition 1 as AppenderVegasCfg{initialLimit=1024, maxConcurrency=32768, al
phaLimit=0.7, betaLimit=0.95}. Window limiting is disabled
2020-04-14 11:11:16.871 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] DEBUG io.zeebe.logstreams - Recovering state of partition 1 from snapshot
2020-04-14 11:11:17.058 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] INFO  io.zeebe.logstreams - Recovered state of partition 1 from snapshot at position 38655128624
2020-04-14 11:11:17.852 [Broker-0-SnapshotDirector-1] [Broker-0-zb-actors-1] DEBUG io.zeebe.logstreams.snapshot - The position of the last valid snapshot is '38655128624'. Taking snapshots beyond this position.
2020-04-14 11:11:17.915 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-1] DEBUG io.zeebe.broker.exporter - Recovering exporter from snapshot
2020-04-14 11:11:17.920 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 partitions [1/1]: partition 1 started in 3383 ms
2020-04-14 11:11:17.920 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 partitions succeeded. Started 1 steps in 3384 ms.
2020-04-14 11:11:17.920 [] [main] DEBUG io.zeebe.broker.system - Bootstrap Broker-0 [10/10]: zeebe partitions started in 3386 ms
2020-04-14 11:11:17.920 [] [main] INFO  io.zeebe.broker.system - Bootstrap Broker-0 succeeded. Started 10 steps in 1292283 ms.
2020-04-14 11:11:17.924 [Broker-0-HealthCheckService] [Broker-0-zb-actors-1] DEBUG io.zeebe.broker.system - All partitions are installed. Broker is ready!
2020-04-14 11:11:18.211 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-1] DEBUG io.zeebe.broker.exporter - Recovered exporter 'Broker-0-Exporter-1' from snapshot at lastExportedPosition 38655128624
2020-04-14 11:11:18.212 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-1] DEBUG io.zeebe.broker.exporter - Configure exporter with id 'elasticsearch'
2020-04-14 11:11:18.212 [Broker-0-Exporter-1] [Broker-0-zb-fs-workers-1] DEBUG io.zeebe.broker.exporter.elasticsearch - Exporter configured with ElasticsearchExporterConfiguration{url='http://elasticsearch-master:9
200', index=IndexConfiguration{indexPrefix='zeebe-record', createTemplate=true, command=false, event=true, rejection=false, error=true, deployment=true, incident=true, job=true, message=false, messageSubscription=f
alse, variable=true, variableDocument=true, workflowInstance=true, workflowInstanceCreation=false, workflowInstanceSubscription=false}, bulk=BulkConfiguration{delay=5, size=1000}, authentication=AuthenticationConfi
guration{username='null'}}

I’m currently not deploying Zeebe with the hazelcast exporter too make the upgrade easier, could this be the issue?

dev config:

{
  "brokers": [
    {
      "partitions": [
        {
          "partitionId": 1,
          "role": "LEADER"
        }
      ],
      "nodeId": 0,
      "host": "workflow-engine-zeebe-0.workflow-engine-zeebe.dev.svc.cluster.local",
      "port": 26501
    }
  ],
  "clusterSize": 1,
  "partitionsCount": 1,
  "replicationFactor": 1
}

ElasticSearch:
replicas: 1

Hey @andyh,

this is expected if you have already data on your pvc.
Is this the case?

Greets
Chris

1 Like

@Zelldon thanks for jumping in… that is a very good question…

Related open issues:

Hi @Zelldon

Yes that’s right, there is data on the pvc.

I wouldn’t have expected there be a lot of data because the number of workflows we have deployed and executed is quite low and we have always exported the data to ElasticSearch.

The environment I am using is a development cluster so the data (at least for now) isn’t important and if this bug is only a short term issue I could delete the data on the disk to speed up the Broker Bootstrap.

I could delete and recreate the storage disks, but is there a (slightly) more elegant way to remove data e.g. could I kubectl exec onto the broker and delete the data?

Thanks
Andy

Hey @andyh,

thanks for your response. Yeah it depends also on the snapshot interval and count etc. and how it long it runs etc. Which version do you use? I think I missed that part.

I could delete and recreate the storage disks, but is there a (slightly) more elegant way to remove data e.g. could I kubectl exec onto the broker and delete the data?

When you removed your helm release you can also just remove the pvc, via kubectl delete persistentvolumeclaims -l app=RELEASE_NAME-zeebe Do you mean something like that?

Greets
Chris

Hi @Zelldon

We are using Zeebe Cluster version 0.22.1. The snapshot configuration is the default, which I believe is 3 snapshots every 15 mins, should we configure the snapshot to be less frequent?

In our environment we have statically provisioned Azure Managed Disks, to support disk replication. So we cannot delete the disk by executing the kubectl delete pvc xyz command. If there is a root directory that contains the data which could be safely removed, that approach would be an easier temporary fix.

It is worth mentioning that I have experimented with Zeebe cluster version 0.22.2, but the current helm chart (version 0.0.88) uses 0.22.1, so for now I have stuck with that version.

Thanks for you help,
Andy

Hey @andyh

We are using Zeebe Cluster version 0.22.1. The snapshot configuration is the default, which I believe is 3 snapshots every 15 mins, should we configure the snapshot to be less frequent?

You can set max count to one, we removed this config on the latest version. This will reduce the kept data.

In our environment we have statically provisioned Azure Managed Disks, to support disk replication. So we cannot delete the disk by executing the kubectl delete pvc xyz command. If there is a root directory that contains the data which could be safely removed, that approach would be an easier temporary fix

I think when the Brokers are stopped you can delete the data/ folder, there is the data stored :slightly_smiling_face:. After deleting the folder you can start the broker again and it should have a clean state.

It is worth mentioning that I have experimented with Zeebe cluster version 0.22.2, but the current helm chart (version 0.0.88) uses 0.22.1, so for now I have stuck with that version.

Thanks for reporting, good to know. @salaboy probably we should update this, but we can also just go to the next minor.

Greets
Chris

Hi @Zelldon

Thanks for informing me how to delete data from the broker. I did attempt to delete the data as instructed but it was starting to become more awkward than just deleting the disk, so I opted for the path of least resistance and recreated the broker disk. :slight_smile:

After deleting the data the broker pod was ready within a few minutes and I could deploy workflow definitions and create instances using the zbctl.

I have also managed to get Simple Monitor deployed but before I deleted the data the hazelcast exporter seemed to be working based on the logs, but when I attempted to deploy a workflow definition using Simple Monitor the request simply timed out (exceeded 15 seconds which I think is the default timeout?).

After deleting the data I was able to get Simple Monitor to deploy workflow definitions, create instances and complete the workflows by sending messages and completing jobs.

So things seem to be working ok, but I do have a couple of questions:

  1. I noticed that if I deploy a workflow definition (bpmn xml file) via the zbctl which is making a request via the standalone gateway the definition will be created, should I resend the exact same definition the version is not incremented and stays at version 1 (expected behaviour). The same is not true when using Simple Monitor, it will create a new version regardless. Why is that and is that expected behaviour?

  2. In Kubernetes the standalone gateway exposes port 26500, which used to be exposed by the Zeebe Service. To get Simple Monitor working I had to change the environment variable: “io.zeebe.monitor.connectionString” to point to the gateway, which makes sense. I also attempted to expose port 5701 (hazelcast port) on the gateway but that didn’t seem to work, so I have left port 5701 exposed via the Zeebe service (and of course the broker container). This means that the Simple Monitor environment variable: “io.zeebe.monitor.hazelcast.connection” is point at the Zeebe Service (not the gateway). Does this seem correct or should I be exposing port 5701 via the gateway service?

Thanks
Andy