How do I tune Zeebe 0.25.1 to reduce memory usage?

Lars Madsen: I think I tried that and it failed

Lars Madsen: will look again

Lars Madsen: zeebe log on startup:

 "rocksdb" : {
      "columnFamilyOptions" : {
        "max_open_files" : "1024",
        "max_write_buffer_size_to_maintain" : "16000000",
        "write_buffer_size" : "8000000"
      }
java.lang.IllegalStateException: Expected to create column family options for RocksDB, but one or many values are undefined in the context of RocksDB [Compiled ColumnFamilyOptions: {compaction_pri=kOldestSmallestSeqFirst, max_open_files=1024, max_write_buffer_size_to_maintain=16000000, write_buffer_size=8000001}; User-provided ColumnFamilyOptions: {max_open_files=1024, max_write_buffer_size_to_maintain=16000000, write_buffer_size=8000001}]. See RocksDB's cf_options.h and <http://options_helper.cc|options_helper.cc> for available keys and values.
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.db.impl.rocksdb.ZeebeRocksDbFactory.createColumnFamilyOptions(ZeebeRocksDbFactory.java:129) ~[zeebe-db-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.db.impl.rocksdb.ZeebeRocksDbFactory.open(ZeebeRocksDbFactory.java:73) ~[zeebe-db-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.db.impl.rocksdb.ZeebeRocksDbFactory.createDb(ZeebeRocksDbFactory.java:58) ~[zeebe-db-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.db.impl.rocksdb.ZeebeRocksDbFactory.createDb(ZeebeRocksDbFactory.java:25) ~[zeebe-db-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.broker.system.partitions.impl.StateControllerImpl.openDb(StateControllerImpl.java:142) ~[zeebe-broker-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.broker.system.partitions.impl.StateControllerImpl.recover(StateControllerImpl.java:125) ~[zeebe-broker-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.broker.system.partitions.impl.steps.ZeebeDbPartitionStep.open(ZeebeDbPartitionStep.java:28) ~[zeebe-broker-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.broker.system.partitions.impl.PartitionTransitionImpl.installPartition(PartitionTransitionImpl.java:97) ~[zeebe-broker-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.broker.system.partitions.impl.PartitionTransitionImpl.lambda$installPartition$2(PartitionTransitionImpl.java:105) ~[zeebe-broker-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:28) [zeebe-util-0.25.1.jar:0.25.1]
[zeebe-cluster-zeebe-1 zeebe-cluster] 	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) [zeebe-util-0.25.1.jar:0.25.1]

Lars Madsen: Ok - cracked it - the setting max_open_files must be omitted, as it is unknown to Rocs

Lars Madsen: up and running :slightly_smiling_face:

Nicolas: Yes, my fault - this is a setting we recommend, but it’s not settable on the column family options, I forgot about that :sweat_smile:

Nicolas: In the next release we will also expose the general RocksDB options (not CF specific), and there you’ll be able to set it. Sorry about that

Lars Madsen: No worries :slightly_smiling_face:

Andy Rusterholz: @Nicolas sorry to resurrect an old thread. We are about to apply the changes you suggested above (write_buffer_size of 8MB and max_write_buffer_size of 16MB). However we are curious: how did you arrive at these values? And what other data points should we be looking at to tell us whether we need to keep adjusting them? Do these need to be adjusted based on resource availability, broker activity, number of partitions, etc.?

Nicolas: One of our devs played around with different numbers and found these to be acceptable. These should be adapted based on how much memory you have per broker and how much partitions a broker could be leader for (so how many live on that broker). In RocksDB, column families only share WAL - everything else is separate. That means they all have their own memtables (or write buffers, which are all in memory), LSM tree (which needs some memory to keep track of and may incur extra memory usage during compaction and flushing), indexes, filters, cache, etc. You can configure it to have the cost amortize across multiple, but that wasn’t done yet and I’m not sure you can do it via properties.

So that said, up until 0.26.0, there are ~50 column families per partition. If you set a max_write_buffer_to_maintain of 16MB, that means you could (worst case scenario) use up to 16MB * 50 per partition, so you want to budget at least this for your broker, plus enough JVM heap for it to run correctly (I would guess, ballpark from experience, about 250MB per partition), plus some leftover memory for the OS page cache - the OS page cache is important to have good read speed with mmap’d files (for the log storage readers) and also for RocksDB’s own file readers (i.e. the compressed block cache). YMMV, but I would guess another 200MB per partition would be good - I recommend you test based on your workload, where having lower load will probably mean you can handle a higher churn rate.

Nicolas: That said, we’re working on improving the situation such that it’s much easier to budget memory per partition. You can see a prototype here that had promising results: https://github.com/zeebe-io/zeebe/issues/4002

Nicolas: So the more memory you give in RocksDB to the write buffers/memtable, the more data it can ingest, meaning the faster you can process things. There’s a point of diminishing returns though, where at some point processing becomes faster than other parts and start being a bottle neck. So you might want to play with how much write buffer memory you give to RocksDB until you have acceptable performance and memory usage.

Andy Rusterholz: Hi Nicolas. Thanks for this information. I think we’re still seeing abnormally-high memory usage. We are using three brokers and three partitions, and we just switched to using the 16MB/8MB config values above. If I understand your formula, that should mean that the maximum memory consumption for one partition should be roughly (250MB heap + 200MB cache + (16MB config value * 50 column families)) or 1250MB, and the maximum memory for one broker should be 3 times this, or 3.75GB. However we are still seeing these brokers consuming up to 4.9GB:

Andy Rusterholz:

Andy Rusterholz: (Note that the configuration values were applied midway through Fri 1/15.)

Andy Rusterholz: So there seems to be over 1GB of unexplained memory consumption by the brokers. This is using Zeebe 0.25.2. Should we file an issue?

Nicolas: There’s some memory overhead per column family that isn’t accounted for with the memtables - for example, when flushing a memtable there’s a constant memory overhead, when compacting it, some overhead for the LSM, etc. This is why we want to trim this down and make it simpler in the next version. You can add your info in https://github.com/zeebe-io/zeebe/issues/5736, I don’t think it’s worth opening a new issue.

As I mentioned, we already have a proposal, but it’s a substantial change which I would normally not be keen to package as a patch due to how different it is - see https://github.com/zeebe-io/zeebe/issues/4002.

Lars Madsen: The problem we saw in earlier versions is that around 4 gb memory we started getting errors starting new workflows. Now I typically restart when I spot errors in the zeebe broker, to avoid client errors. Is there a known limit of menory consumed where the brokers become unstable?

Nicolas: How much memory are you provisioning?

Nicolas: Errors related to memory usage would be in relation to how much memory is available (to some extent - high memory usage might also indicate that your system is under high load, which would slow down some things but I wouldn’t expect actual errors, mostly warnings that things are timing out and so on)