Memory profile for zeebe brokers

Broker version: 0.24.3
We run zeebe on kubernetes with 3 brokers and a gateway (Helm chart) - elasticsearch exporter

Previous to 0.24.3 we experienced alot of problems with our zeebe setup, where brokers would restart and not come back for a long time (days) - related to https://github.com/zeebe-io/zeebe/issues/5135 we believe.

We think the restarts were caused by kubernetes deleting the pods because of memory usage (4gib limit), and on restart it was suffering the issue above, causing the system to be for all purposes down. This as been reported in zeebe slack channel earlier.

With 0.24.3, we saw restarts being quick, but we don’t necessarily know how the system will respond once the pods get restarted by kubernetes - which they will be soon due to memory usage. Here is the grafana output for one of the brokers:

Some snapshots from the cluster: Current resource usage:

Uptime:

Currently we are running one workflow, with approximately 500-1000 instances per 24/h

The issue looks similar to: Zeebe broker 0.20.0 memory usage is too high
however, given its age, I would have hoped for it to have been addressed by now, based on the severity.

I will update the issue once the pods are restarted, and describe the behaviour we see (does it restart quickly or does it cause downtime like before)

Cheers,
Lars

Update 30/09

I realise the original post was a little bit difficult to interpret - what was the question posed?
The tl;dr was that we have suffered the issue where brokers wouldn’t come online after being restarted, which was reported fixed in 24.3. Even after 24.3 we noticed the memory usage usage for the brokers was always climbing, and feared that once the broker was deleted due to resource policies, the same issue would occur.

So - what have we observed:

Deleted the main broker (broker-1, the one with highest resource usage) instead of waiting for Kubernetes to kill the pod. This to ensure we could watch the restart. The really, really REALLY great news is that the broker was back up in less than a minute and there was no downtime.

While there is certainly a less than desirable memory usage, the fact that the system can restore itself quickly means the impact is neglible.

2 Likes

Thanks for updating @LarsMadMan

Might be related to one of this: https://github.com/zeebe-io/zeebe/issues/4812 or https://github.com/zeebe-io/zeebe/issues/3988

Greets
Chris

1 Like

Thanks Chris - looks relevant, but hard to say for a layman :slight_smile: . We don’t have long running workflows, they will typically run from start to finish in a few seconds. Looks like the issues are not resolved or currently worked on - are there any plans to reopen? From my point of view that would be very desirable. Should we potentially adjust memory limits in kubernetes to restart pods earlier than 4gib? I mean to avoid performance degradation when memory usages gets very high.

Regards,
Lars