Not all the workflow instances are exported

Hi guys,

We installed in k8s zeebe with hazelcast exporter and simple monitor and we noticed that not all the workflow instances are exported.

This seems to be related to the ability to communicate with all the brokers from simple monitor?

The pattern we noticed is the following:

  • If we have 2 brokers and 3 partitions we can see instances from only two partitions
  • if we have 3 brokers and 3 partitions we can see instances from only one partition
  • if we have 2 brokers and 1 partition we see all the instances (since there is only one partition)

What we believe is happening is that simple monitor is talking only to one broker for exporting the data and thus can see only the partitions in that broker.

We added the hazelcast exporter to zeebe and we enabled port 5701 in the statefulset and we we added a service for that port.

Maybe the gateway is aggregating the data from all the brokers and we should be going through it instead of addressing zeebe brokers from a k8s service? If that is the case , is there a way to do that? To have the gateway aggregate the exported data from hazelcast?

Thanks for your help.
Lucas L.-

Hi @lucasledesma :wave:

If you’ve a cluster of multiple brokers then I recommend to use an external Hazelcast instance/cluster and configure the exporter to insert the records in this external instance.

In the configuration of the exporter, you need to set the argument remoteAddress that points to the external Hazelcast. Please have a look at GitHub - camunda-community-hub/zeebe-hazelcast-exporter: Export events from Zeebe to Hazelcast

Does this help you?

Best regards,
Philipp

1 Like

@philipp.ossler , yes this helps. Thanks!
We saw that and we will try with an external hazelcast instance.

Also, we saw in simple monitor that when you get to have more workflow instances than the ones that fit in one page, and you press “next” , simple monitor says that the pagination is not yet implemented. We are using simple-monitor version 0.18.0, do you know if in the newer versions this is already there?
Thanks,
Lucas L.-

I’m not aware of this bug :sweat_smile: Please create an issue :slight_smile:

@philipp.ossler

Alright, this is the issue we created.

Thanks!
Lucas L.-

1 Like

@philipp.ossler
BTW, we were able to get data from all the brokers using an external hazelcast cluster as you suggested.
Thanks!

1 Like

We closed the pagination issue since we found it was related to the hibernate dialect.

We changed spring.jpa.properties.hibernate.dialect from

org.hibernate.dialect.SQLServerDialect

to

org.hibernate.dialect.SQLServer2008Dialect

and it worked!!

2 Likes

Hi @philipp.ossler,
I wanted to know if you saw this behaviour before.

We deployed zeebe with an external hazelcast cluster (as suggested), with zeeqs and simple monitor in k8s.

When we start workflows one by one, we can see them in simple monitor and zeeqs without any problem.

But when we tried to start many workflows in parallel (like 100) we can see zeebe completes the workflows, zeeqs counts 100 workflow instances, but simple monitor seems to hang somehow, it can see the first 12-20 but them it stops updating its DB and can’t recover from there anymore.

Any ideas what could be happening?

We don’t see any error or anything in the logs, simple monitor seems to simple stops receiving messages.

We tried versions
zeebe 0.23.4
hazelcast exporter 0.9.0
hazelcast 4.0.1 (helm chart 3.4.5)
simple monitor 0.19.0

Thanks for your help,
Lucas L.-

No. This sounds not familiar to me. Without any errors in the log, it is hard to say that is going wrong.

Can you reproduce this behavior? Or, does it happen just once?

It is reproducible. It happens every time we increase the number of workflows we run in parallel in zeebe. It seems that simple monitor is getting stuck (or gets really slow) at some point when there is too much info to get from hazelcast/zeebe.
We noticed that when using in-memory db this doesn’t seem to happen and simple monitor is much faster. So it might be related to the performance hit due to having to update the db (postgres) and things pilling up?

We also got these logs from simple monitor

java.lang.IllegalArgumentException: Invalid character found in method name [0xc10xbe0xd7A0x100x91]0x17bg]. HTTP method names must be tokens
        at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:418) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:260) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:868) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1590) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0.35.jar:9.0.35]
        at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

2020-08-04 12:59:21.279  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 69900 ms. Closing WebSocketServerSockJsSession[id=4ik0xwcf].
2020-08-04 12:59:21.281  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 81988 ms. Closing WebSocketServerSockJsSession[id=f3ce34uo].
2020-08-04 12:59:21.281  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 104994 ms. Closing WebSocketServerSockJsSession[id=i10y2mej].
2020-08-04 12:59:21.281  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 102799 ms. Closing WebSocketServerSockJsSession[id=tbhguzvf].
2020-08-04 12:59:21.282  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 96084 ms. Closing WebSocketServerSockJsSession[id=ys44spjs].
2020-08-04 12:59:21.282  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 67713 ms. Closing WebSocketServerSockJsSession[id=5n0ku2bo].
2020-08-04 12:59:21.282  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 89065 ms. Closing WebSocketServerSockJsSession[id=vqzlxbvv].
2020-08-04 12:59:21.283  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 79823 ms. Closing WebSocketServerSockJsSession[id=jtedgd0y].
2020-08-04 12:59:21.283  INFO 1 --- [nio-8082-exec-6] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 100169 ms. Closing WebSocketServerSockJsSession[id=hduvgu4r].
2020-08-04 13:04:59.333  INFO 1 --- [nio-8082-exec-5] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 335028 ms. Closing WebSocketServerSockJsSession[id=x1zvo52e].
2020-08-04 13:04:59.336  INFO 1 --- [nio-8082-exec-5] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 365036 ms. Closing WebSocketServerSockJsSession[id=wabwrpmc].
2020-08-04 13:04:59.337  INFO 1 --- [nio-8082-exec-5] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 362268 ms. Closing WebSocketServerSockJsSession[id=fk2wdcxf].
2020-08-04 13:04:59.337  INFO 1 --- [nio-8082-exec-5] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 345957 ms. Closing WebSocketServerSockJsSession[id=bbqasyyr].
2020-08-04 13:04:59.338  INFO 1 --- [nio-8082-exec-5] o.s.w.s.m.SubProtocolWebSocketHandler    : No messages received after 355320 ms. Closing WebSocketServerSockJsSession[id=vwpsgejp].



2020-08-04 13:13:48.167  INFO 1 --- [MessageBroker-2] o.s.w.s.c.WebSocketMessageBrokerStats    : WebSocketSession[2 current WS(1)-HttpStream(1)-HttpPoll(0), 42 total, 28 closed abnormally (28 connect failure, 0 send limit, 2 transport error)], stompSubProtocol[processed CONNECT(11)-CONNECTED(11)-DISCONNECT(10)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 222], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 2442], sockJsScheduler[pool size = 2, active threads = 1, queued tasks = 3, completed tasks = 497]

Not sure if those help.

Thanks,
Lucas L.-

we created this issue https://github.com/zeebe-io/zeebe-simple-monitor/issues/168

we believe it can be related to a race condition. We did the same test using operate and what we see is that under heavy load, the task handlers start to take longer than the timeout set up for those tasks and incidents start to appear due to that reason.

When this is the case simple monitor hangs, but not operate. And simple monitor can not recover from that point on, even if we delete the monitor pod or even the monitor pod and the hazelcast pods and volumes.

In one of the tests, we were able to make simple monitor recover by deleting the pod and deleting the entry in the table hazelcast_config. But this is not always the case, in that case we saw incidents coming into the incidents table, but then we could not resolve them because they were pointing to non existent workflows instances. We believed some info might have been lost in the middle.

In operate you can see the incidents pilling up but you can retry them later and they go through.

Hope this information helps. We can try to get more info if that helps you.

Best regards,
Lucas L.-

Thank you for sharing!
I’ll try to reproduce it when I’ve time :slight_smile:

2 Likes