Performance profiling tool

I’ve updated my ghetto TPS tool to enable profiling Zeebe releases. You can check it out here: https://github.com/jwulf/zeebe-garage-benchmark.

It uses Docker to start various brokers and create workflows as fast as possible. You can run it with back pressure enabled or disabled, and specify the number of partitions.

These numbers are an example. You should not use these numbers, you should run the test yourself, for a long period of time.

Here is an example of it running on my machine with backpressure disabled, with 3 partitions:

➜ t test.ts -z 0.22.5,0.23.5,0.24.1,0.25.0-alpha2 -t 30 -d -p 3

Version: 0.22.5 | Time: 30s
Starting Zeebe 0.22.5 with 3 partitions | Backpressure disabled...
Started Zeebe broker
Workflow deployed: 2251799813685250
Time :   Total   | wf/s  | running average
5s :     145     | 29    | 29/sec
10s :    345     | 40    | 35/sec
15s :    609     | 53    | 41/sec
20s :    915     | 61    | 46/sec
25s :    1264    | 70    | 51/sec
Average TPS: 51/sec.

Version: 0.23.5 | Time: 30s
Starting Zeebe 0.23.5 with 3 partitions | Backpressure disabled...
Started Zeebe broker
Workflow deployed: 2251799813685250
Time :   Total   | wf/s  | running average
5s :     110     | 22    | 22/sec
10s :    227     | 23    | 23/sec
15s :    350     | 25    | 23/sec
20s :    475     | 25    | 24/sec
25s :    613     | 28    | 25/sec
Average TPS: 25/sec.

Version: 0.24.1 | Time: 30s
Starting Zeebe 0.24.1 with 3 partitions | Backpressure disabled...
Started Zeebe broker
Workflow deployed: 2251799813685250
Time :   Total   | wf/s  | running average
5s :     47      | 9     | 9/sec
10s :    136     | 18    | 14/sec
15s :    252     | 23    | 17/sec
Average TPS: 17/sec.

Version: 0.25.0-alpha2 | Time: 30s
Starting Zeebe 0.25.0-alpha2 with 3 partitions | Backpressure disabled...
Started Zeebe broker
Workflow deployed: 2251799813685250
Time :   Total   | wf/s  | running average
5s :     184     | 37    | 37/sec
10s :    384     | 40    | 38/sec
15s :    599     | 43    | 40/sec
20s :    821     | 44    | 41/sec
25s :    1045    | 45    | 42/sec
Average TPS: 42/sec.

Same test with 2 partitions:

Version: 0.22.5 | Time: 30s
Starting Zeebe 0.22.5 with 2 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     288     | 58    | 58/sec
10s :    640     | 70    | 64/sec
15s :    1007    | 73    | 67/sec
20s :    1385    | 76    | 69/sec
25s :    1731    | 69    | 69/sec
Average TPS: 69/sec.

Version: 0.23.5 | Time: 30s
Starting Zeebe 0.23.5 with 2 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     76      | 15    | 15/sec
10s :    146     | 14    | 15/sec
15s :    216     | 14    | 14/sec
20s :    289     | 15    | 14/sec
25s :    359     | 14    | 14/sec
Average TPS: 14/sec.

Version: 0.24.1 | Time: 30s
Starting Zeebe 0.24.1 with 2 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     55      | 11    | 11/sec
10s :    144     | 18    | 14/sec
15s :    255     | 22    | 17/sec
20s :    381     | 25    | 19/sec
25s :    530     | 30    | 21/sec
Average TPS: 21/sec.

Version: 0.25.0-alpha2 | Time: 30s
Starting Zeebe 0.25.0-alpha2 with 2 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     145     | 29    | 29/sec
10s :    311     | 33    | 31/sec
15s :    484     | 35    | 32/sec
20s :    658     | 35    | 33/sec
25s :    837     | 36    | 33/sec
Average TPS: 33/sec.

Here it is running the same test, with one partition:

➜ t test.ts -z 0.22.5,0.23.5,0.24.1,0.25.0-alpha2 -t 30 -d -p 1

Version: 0.22.5 | Time: 30s
Starting Zeebe 0.22.5 with 1 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     168     | 34    | 34/sec
10s :    400     | 46    | 40/sec
15s :    666     | 53    | 44/sec
20s :    963     | 59    | 48/sec
25s :    1307    | 69    | 52/sec
Average TPS: 52/sec.

Version: 0.23.5 | Time: 30s
Starting Zeebe 0.23.5 with 1 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     149     | 30    | 30/sec
10s :    309     | 32    | 31/sec
15s :    492     | 37    | 33/sec
20s :    663     | 34    | 33/sec
25s :    835     | 34    | 33/sec
Average TPS: 33/sec.

Version: 0.24.1 | Time: 30s
Starting Zeebe 0.24.1 with 1 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     51      | 10    | 10/sec
10s :    108     | 11    | 11/sec
15s :    173     | 13    | 12/sec
20s :    252     | 16    | 13/sec
25s :    326     | 15    | 13/sec
Average TPS: 13/sec.

Version: 0.25.0-alpha2 | Time: 30s
Starting Zeebe 0.25.0-alpha2 with 1 partitions | Backpressure disabled...
Started Zeebe broker
Time :   Total   | wf/s  | running average
5s :     151     | 30    | 30/sec
10s :    367     | 43    | 37/sec
15s :    627     | 52    | 42/sec
20s :    931     | 61    | 47/sec
25s :    1268    | 67    | 51/sec
Average TPS: 51/sec.

Here are the aggregated results. I’ve put an asterisk next to the fastest partition configuration for each version.

Again, don’t refer to these results as the official performance. Run the test yourself:

Version.   Partitions       Average TPS
0.22.5.          1              52
                 2*             69
                 3              51
0.23.5           1*             33
                 2              14
                 3              25
0.24.1           1              13
                 2*             21
                 3              17
0.25.0-a2        1*             51
                 2              33
                 3              42

You really need to run them longer to see how each performs over time, with garbage collection etc…

But this is a good tool for examining the relative performance of releases, and the impact of partitioning.

1 Like

Great work, @jwulf. Very useful.

Some ideas for your tests:

  1. Count not only how many workflows you are able to start but how many workflows are completed per/sec.
  2. Collect how long it takes to complete a single “nothing” workflow.
  3. Collect the time between workflow instance is created and the time the “nothing” task is started to executing.
  4. Run several Zeebe brokers, not only partitions. E.g. 3 brokers, 3 partitions, 3 replication factor.

These metric may show surprising results of how surprisingly bad Zeebe works in relation to available hardware.

You are right about running these test for long periods, like months. What I learned from similar tests that even if everything is perfect, Zeebe goes into itself several times a day that produced significant delays in processing workflows. But the perfect state is only temporary. Normal state: one broker is lost or one partition is lost, significant delays in executing all the time.

Thanks @SergeyL!

Have a look at the TODO list in the README file. These are on the road map. Great minds think alike. :smile:

I used this stress test to zeebe:
#!/bin/bash
docker pull williamyeh/wrk
myip=$(hostname -I | cut -d’ ’ -f1)

run 100 connection within 10 threads during 10 seconds (spam start process)
docker run --rm -v pwd:/data williamyeh/wrk -s post.lua -t10 -c100 -d10s http://:2700$myip/zeebe/process/payment-retrieval/start

post.lua
wrk.method = “POST”
wrk.headers[“Content-Type”] = “application/json”

with a result like that:
10 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 520.54ms 217.71ms 1.59s 79.36%
Req/Sec 20.51 12.67 80.00 78.47%
1763 requests in 10.05s, 581.93KB read
Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec: 175.45
Transfer/sec: 57.91KB

root@zeebe:/usr/docker/bpmn-gateway/test# sar 5 5
Linux 5.4.0-51-generic (zeebe) 10/31/2020 x86_64 (4 CPU)

09:46:29 PM CPU %user %nice %system %iowait %steal %idle
09:46:34 PM all 78.39 0.00 20.51 0.00 0.00 1.10
09:46:39 PM all 79.06 0.00 19.48 0.05 0.00 1.41
09:46:44 PM all 76.41 0.00 19.98 0.15 0.00 3.46
09:46:49 PM all 60.63 0.00 15.22 0.00 0.00 24.15
09:46:54 PM all 27.45 0.00 13.50 0.05 0.00 58.99
Average: all 65.02 0.00 17.82 0.05 0.00 17.10

and looked logs to see how many errors and timeouts and
what time is needed to end up all workflows.

Actually it is possible to calc everything without studing logs from elestic indexes or to implement some feedback from workers.

1 Like

Lua :ok_hand: