Zeebe Cluster 0.23.1 hangs on "cluster services"

I am trying to deploy a zeebe cluster (with standalone gateway) on Kubernetes, but the brokers suddenly stop after displaying this:
INFO io.zeebe.broker.system - Bootstrap Broker-0 [6/10]: cluster services
This happens on all nodes unfortunately.
When I get status of the cluster using zbctl from the gateway I get:

Cluster size: 0
Partitions count: 0
Replication factor: 0
Gateway version: 0.23.1
Brokers:

Broker configuration:

# In ConfigMap zeebe-cluster
# key: application.yml
zeebe:
    broker:
        gateway:
            enable: false
        cluster:
             clusterName: zeebe-cluster
---
# Kubernetes statefulset configuration file
apiVersion: apps/v1
kind: StatefulSet

metadata:
    name: zeebe-cluster

spec:
    selector:
        matchLabels:
            app: zeebe
            name: zeebe
    serviceName: zeebe
    replicas: 3
    updateStrategy:
        type: RollingUpdate
    podManagementPolicy: Parallel
    template:
        metadata:
            labels:
                app: zeebe
                name: zeebe
        spec:
            terminationGracePeriodSeconds: 10
            containers:
                -   name: zeebe
                    image: camunda/zeebe:0.23.1
                    env:
                        -   name: ZEEBE_LOG_LEVEL
                            value: debug
                        -   name: ZEEBE_BROKER_CLUSTER_PARTITIONSCOUNT
                            value: "2" # also tried with 3
                        -   name: ZEEBE_BROKER_CLUSTER_CLUSTERSIZE
                            value: "3"
                        -   name: ZEEBE_BROKER_CLUSTER_REPLICATIONFACTOR
                            value: "3"
                        -   name: ZEEBE_BROKER_GATEWAY_NETWORK_PORT
                            value: "26500"
                    ports:
                        -   containerPort: 26500
                            name: gateway
                        -   containerPort: 26501
                            name: command
                        -   containerPort: 26502
                            name: internal
                        -   containerPort: 9600
                            name: metrics
                    resources:
                        requests:
                            cpu: 500m
                            memory: 2G
                        limits:
                            cpu: 1000m
                            memory: 2G
                    volumeMounts: 
                      -   name: config
                          mountPath: /usr/local/zeebe/config
                      -   name: startup
                          mountPath: /usr/local/bin
                      -   name: data
                          mountPath: /usr/local/zeebe/data
                volumes:
                    -   name: config
                        configMap:
                            name: zeebe-config
                            defaultMode: 0777
                    -   name: startup
                        configMap:
                            name: zeebe-startup-broker
                            defaultMode: 0777
        volumeClaimTemplate:
            -  metadata:
                    name: data
            spec:
                 accessModes: [ "ReadWriteOnce" ]
                 storageClassName: basic
                 resources:
                     requests:
                         storage: 5Gi


Startup ConfigMap:
key: startup.sh

#!/bin/bash -xeu
export ZEEBE_BROKER_CLUSTER_NODEID="${HOSTNAME##*-}"
ZEEBE_NODE_ID=$ZEEBE_BROKER_CLUSTER_NODEID

# all broker contact points
ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS="${HOSTNAME}:26502,zeebe-gateway:26502"
for (( i=0; i<$ZEEBE_BROKER_CLUSTER_CLUSTERSIZE; i++ ))
do
    if [ "${ZEEBE_NODE_ID}" != "${i}" ]; then
        ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS="${ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS},${HOSTNAME:0:(-1)}${i}.zeebe.default.svc.cluster.local:26502"
    fi
done
export ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS="${ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS}"

# Normal startup
if [ "$ZEEBE_STANDALONE_GATEWAY" = "true" ]; then
    export ZEEBE_GATEWAY_CLUSTER_HOST="${ZEEBE_GATEWAY_CLUSTER_HOST}:-${ZEEBE_HOSTS}"
    exec /usr/local/zeebe/bin/gateway
else
    exec /usr/local/zeebe/bin/broker
fi

Gateway configuration:

zeebe:
    gateway:
        cluster:
            contactPoint: zeebe-cluster-0.zeebe.default.svc.cluster.local:26502
        monitoring:
             enabled: true
             port: 9600

Any help would be greatly appreciated!

1 Like

@zerbrinsky have you tried the Helm Charts?
Maybe that be the best place to look at … charts are already working and people is reporting issues in there.

2 Likes

Unfortunately the issue also persists when using helm…
Except now I also get the following warning:
[raft-server-0-system-partition-1] WARN io.atomix.raft.roles.FollowerRole - RaftServer{system-partition-1}{role=FOLLOWER} - java.net.ConnectException
I should also mention that I am trying to run this on Openshift 3.9

Just tried with camunda/zeebe:0.23.1-non-root image, still experiencing the same problem

Can you use gist.github.com to share the logs from the broker? While brokers are starting it is ok to get some warnings… but they should go away when all the brokers are up.
The more that you can share the better.

logs:
zeebe-broker-logs

@zerbrinsky there is no error there that I can see… what is exactly not working?

It just hangs there forever, even 10 hours after initial pod creation

When I deploy a single broker it works fine

For some reason setting the replicationfactor to 1 seems to do the trick

But with either 2 or 3 I encounter the same issue

@zerbrinsky interesting… I am deploying the same chart with 3 brokers and it is all fine… it might be that you are hitting some kind of limits in your kubernetes cluster… that will explain it… describe the pods and you might find the reason in the “Events” section.

I don’t see any unusual events…
Maybe it could be related to the standalone gateway?

@zerbrinsky take a look at the logs in the gateway… or feel free to join the zeebe-k8s slack channel to see if we can help you out to get things working.

Will definitely do that, thanks for the suggestion!

I have just tried to deploy the zeebe cluster with the same configuration on GKE, with the same result, which leads me to believe I have a configuration error somewhere.
Gateway logs and configuration: https://gist.github.com/zerbrinsky/a8f052b58463a70584446150b1884333

Managed to get it working on GKE!
I had to add this to startup.sh:

#########################
# Normal startup
if [ "$ZEEBE_STANDALONE_GATEWAY" = "true" ]; then
    export ZEEBE_GATEWAY_CLUSTER_HOST="${ZEEBE_GATEWAY_CLUSTER_HOST}:-${ZEEBE_NODE_ID}"
    exec /usr/local/zeebe/bin/gateway
else
    # This line here!
    export ZEEBE_BROKER_NETWORK_HOST="zeebe-cluster-${ZEEBE_NODE_ID}.zeebe.default.svc.cluster.local" 
    
    exec /usr/local/zeebe/bin/broker
fi

This changed the advertisedHost from 0.0.0.0 (which I thought would be correct) to the actual contact point in the kubernetes cluster.
This sunday I will test this on Openshift and we’ll see how that goes :smiley:

2 Likes

Update: Also working on Openshift!

2 Likes

@zerbrinsky can you keep me posted about your changes? It will be good to add these changes to the charts, so you don’t need to customise it… can you provide a PR to the zeebe-cluster-helm chart?

Cheers