How do I tune Zeebe 0.25.1 to reduce memory usage?

Andy Rusterholz: In our system we are giving the brokers 5GB (5120MiB to be exact). We raised this from 4GB because we frequently saw Zeebe running out of memory, but with 5GB it still slowly grows and grows until it runs out of memory and gets killed by kubernetes. Our operations team is eager for me to find a way to resolve this behavior.

Andy Rusterholz: I can pretty well guarantee that we are not under high load. We run one synthetic workflow every 5 minutes just to be sure that Zeebe is still responding, but other than that, we start maybe 1-5 workflows per day.

Andy Rusterholz: The weirdest thing, to me, is that we didn’t see any change at all in the memory consumption behavior when we applied the 8MB/16MB configuration values above, even though this is 1/8th of the default values. Does this suggest that Zeebe’s memory consumption patterns might not be dependent on the RocksDB column families?

Lars Madsen: My gutfeel on this would be memory leak somewhere. Our system has limited utilisation, 1 workflow started per minute. Takes 10 seconds to complete. So active load on the system will almost always be zero. There does not seem to be any reasons why the broker should use anywhere near the amount of resources, seeing how it starts up at 600 mb then slowly bloats to 4 gb

Nicolas: Couple things here. First, when running it as a container, its normal to some extent that the rss (and more so the wss) will tend to reach the max. This is even more apparent in 0.26 where we use memory mapped files more extensively, and the OS will use as much as it can for page cache. That said, it shouldn’t lead to oomkills, of course.

Second, regarding where the issue is, the profiling issue linked above documents different runs we did with jemalloc and tcmalloc, with allocation profiling turned on (I also did a run with massif but that was a pain to get working with the jvm). Those seem to point to RocksDB as the main culprit over time, even with native memory tracking enabled in the JVM.

Finally, the 1 column family prototype we did also documents different runs which were very stable memory wise, both with stable and preemtible nodes. The WSS would reach the limit and hover there, as expected, with the page cache being cleared out as need be.

Of course none of this proves the absence of memory leaks, but I hope it explains why we believe the main issue to be with our (somewhat wrong) usage of RocksDB in 0.25/0.26

Lars Madsen: I see - thanks for your explanation. I will leave the brokers running, to see if we experience failures when starting new workflows, or if the errors only affect broker internals.

Lars Madsen: @Nicolas you mentioned JVM settings, should I adjust those? We’ve left them at defaults. So messagesize we’ve set to 128, columnfamily options as above, and kubernetes memory limits 4 gb.

Lars Madsen: Here is what I am seeing now that the active broker is running close to its limits:

2021-01-21 08:36:55.194 [] [raft-server-1-raft-partition-partition-2] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-2} - AppendRequest{term=59, leader=1, prevLogIndex=11857299, prevLogTerm=59, entries=9, commitIndex=11857299} to 0 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request type raft-partition-partition-2-append timed out in 5000 milliseconds
[zeebe-cluster-zeebe-1 zeebe-cluster]  2021-01-21 08:36:55.199 [] [raft-server-1-raft-partition-partition-3] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-3} - AppendRequest{term=55, leader=1, prevLogIndex=11955515, prevLogTerm=55, entries=7, commitIndex=11955516} to 0 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request type raft-partition-partition-3-append timed out in 5000 milliseconds

Lars Madsen: the other brokers are logging :

[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:08:35.287 [] [raft-server-0-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 1 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-3-poll timed out in 5000 milliseconds
[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:08:38.451 [] [raft-server-0-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 2 failed: java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-3-poll' to member '2', but member is not known. Known members are '[Member{id=1, address=zeebe-cluster-zeebe-1.zeebe-cluster-zeebe.zeebe.svc.cluster.local:26502, properties={brokerInfo=EADJAAAAAwABAAAAAwAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGlHAAAAemVlYmUtY2x1c3Rlci16ZWViZS0xLnplZWJlLWNsdXN0ZXItemVlYmUuemVlYmUuc3ZjLmNsdXN0ZXIubG9jYWw6MjY1MDEFAAMBAAAAAQIAAAAAAwAAAAEMAAECAAAAOwAAAAAAAAAGAAAAMC4yNi4wBQADAQAAAAECAAAAAQMAAAAB}}, Member{id=zeebe-cluster-zeebe-gateway-6f457bd79f-8n8rb, address=10.208.9.74:26502, properties={event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=zeebe-cluster-zeebe-0.zeebe-cluster-zeebe.zeebe.svc.cluster.local:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAwAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGlHAAAAemVlYmUtY2x1c3Rlci16ZWViZS0wLnplZWJlLWNsdXN0ZXItemVlYmUuemVlYmUuc3ZjLmNsdXN0ZXIubG9jYWw6MjY1MDEFAAMBAAAAAQIAAAABAwAAAAEMAAAGAAAAMC4yNi4wBQADAQAAAAECAAAAAQMAAAAB}}]'.
[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:08:43.451 [] [raft-server-0-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 1 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-3-poll timed out in 5000 milliseconds
[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:08:48.127 [] [raft-server-0-raft-partition-partition-3] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-3}{role=FOLLOWER} - Poll request to 2 failed: java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-3-poll' to member '2', but member is not known. Known members are '[Member{id=1, address=zeebe-cluster-zeebe-1.zeebe-cluster-zeebe.zeebe.svc.cluster.local:26502, properties={brokerInfo=EADJAAAAAwABAAAAAwAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGlHAAAAemVlYmUtY2x1c3Rlci16ZWViZS0xLnplZWJlLWNsdXN0ZXItemVlYmUuemVlYmUuc3ZjLmNsdXN0ZXIubG9jYWw6MjY1MDEFAAMBAAAAAQIAAAAAAwAAAAEMAAECAAAAOwAAAAAAAAAGAAAAMC4yNi4wBQADAQAAAAECAAAAAQMAAAAB}}, Member{id=zeebe-cluster-zeebe-gateway-6f457bd79f-8n8rb, address=10.208.9.74:26502, properties={event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=zeebe-cluster-zeebe-0.zeebe-cluster-zeebe.zeebe.svc.cluster.local:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAwAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGlHAAAAemVlYmUtY2x1c3Rlci16ZWViZS0wLnplZWJlLWNsdXN0ZXItemVlYmUuemVlYmUuc3ZjLmNsdXN0ZXIubG9jYWw6MjY1MDEFAAMBAAAAAQIAAAABAwAAAAEMAAAGAAAAMC4yNi4wBQADAQAAAAECAAAAAQMAAAAB}}]'.
[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:27:11.255 [] [raft-server-0-raft-partition-partition-1] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Poll request to 2 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-1-poll timed out in 5000 milliseconds
[zeebe-cluster-zeebe-0 zeebe-cluster]  2021-01-20 10:27:19.621 [] [raft-server-0-raft-partition-partition-1] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Poll request to 2 failed: java.util.concurrent.TimeoutException: Request type raft-partition-partition-1-poll timed out in 5000 milliseconds

Lars Madsen: In our system, top returns

zeebe-cluster-zeebe-0                          46m          641Mi
zeebe-cluster-zeebe-1                          91m          3754Mi
zeebe-cluster-zeebe-2                          82m          732Mi

Basically, broker 1 took over from broker 2 when I restarted broker2 yesterday. The system looks to be responding to starting new workloads, but the logging is very worrying

Nicolas: That does sound like a memory leak, you’re right - it makes sense that as memory becomes scarce things get slower, and you start getting time outs (which will lead to elections, causing failover and small windows of unavailability). We’ve ran high load tests fairly stable for weeks by overprovisioning for memory at the moment (that is, until we have the fix I linked to above) - something like 3Gi per partition. It’s quite a lot, but seems to be working, and we’re fairly optimistic about reducing it in the next release. So for 3 partitions, 3 replicas, we run something like:

request:
  cpu: 5
  memory: 9Gi
limit:
  cpu: 6
  memory: 12Gi

That seems stable enough, but we run on preemptible nodes (cheaper nodes which get restarted at least once a day by Google), so we wouldn’t exactly notice a slow memory leak. We’ve ran load tests on stable nodes as well, and we did observe that memory was ballooning slowly, and linked it to the RocksDB usage - you can find that in the profiling issue I linked as well above. There the settings @Deepthi found seemed to have brought that under control, iirc. Setting max_open_files to a lower number also seemed to help, I think. The issue probably has more info on different settings than I’ve mentioned here, I’d need to re-read it to know what would be a good workaround for you until the next release.

Lars Madsen: The interesting thing is that the broker 0 logs seems to indicate that it can’t talk to broker 2, which was restarted yesterday, but today is running around 800 mb and no traffic in the logs. I would have thought the unavailable broker would be the one running near its limit (broker 1)

Lars Madsen: The system is up and processing workflows

Lars Madsen: Actually, please disregard the latest message - the log from broker 0 is not from today, but from when broker 2 was restarted, so makes sense it could’t talk to it :smile:

Lars Madsen: But, there must be something missing with my understanding of Zeebe. As a workflow is completed, alle events regarding it are now impossible to get out of the brokers, and it is only accessible if exported (using something as Operate to view the Elasticsearch storage). As such, there would be limited reasons why it should exist in the brokers memory. I guess its because deleting obsolete records and/or compacting databases like rocsDB is expensive performance wise, that it’s not done more aggressively in Zeebe? Because it does seem a bit wasteful to keep storing large amounts of data that can’t even be accessed?

Nicolas: I think it might help to understand how RocksDB allocates memory. Each column family (think of them as tables in a database to some extent) will keep a certain number of memtables (or write buffers) around. At any given point, there’s only one “active” memtable - that is, only one where you can write new entries. Once a memtable is full, then it will create a new one, and schedule the full ones for flushing to disk. However, this is asynchronous, so your memory usage may be temporarily greater. Furthermore, to avoid doing too much I/O, it will try to batch the memtables, waiting for a minimum of them it can merge into a single SST file.

So let’s say each column family will maintain up to 6 memtables, each of which needs 16MB to be full - then each column family will use up to 6 * 16MB. If you’re under low load, it will still keep some of these memtables around for a while before flushing them. When taking a snapshot (or checkpoint in RocksDB lingo), they will be flushed, which may cause smaller than expected SST files.
Then, with low load, you run into another issue - since your flushes tend to be smaller,

Nicolas: I hope this is somewhat understandable :sweat_smile: I feel like that’s more of a braindump - I think the profiling issue and issue 4002 provide more insight into how memory is managed and the downside of having multiple column families, and why it would lead to memory bloat over time even under low load (and even if the database on disk is actually small).

Lars Madsen: yes, my knowledge of the internals is of course (and probably should be) limited. One thing I see is a large number of SST files, are those files that have been flushed to disk as you say? How does these get deleted? I see the runtime folder using over a gig of mem inside the container, all made up of (thousands?) SST files

Nicolas: SST files are immutable - however, as compaction occurs, some will over time become unused, and will be removed. However in 0.26 and prior, compaction is not configured and uses purely default, such that each level is quite big and compaction takes a while to remove smaller files.

Nicolas: The proposal in #4002 addresses this, and we now spent time understanding how it works to make sure we fix this. And you’re right, as a user I wouldn’t expect you to know this, but I’m also not sure how else to explain why we think properly configuring and using RocksDB is the key to solve the current issue, and why we think RocksDB is the reason for the current memory bloat issues :sweat_smile: