Message Subscription event goes missing intermittently

Sachin · January 13, 2020, 10:31am

We are experiencing issues with message subscription event missing and the workflow get stuck in processing.
Scenario -
Task A(Task) (publish message event) - > MessageEvent(Task) → Notification(Task)
Task A completes processing and gets the status as COMPLETED in the elastic export event.
But at times the MessageEvent subscription for the same correlation key does not get created, the workflow get stuck
As per the zeebe docs - The message event subscription event occurs when the flow enters in to Message Event

A subscription is opened when a workflow instance awaits a message, for example, when entering a message catch event.

But this at times not appears in the exported event logs (around 10% of the times), any headsup on the same.

Philipp_Ossler · January 13, 2020, 1:34pm

@Sachin, thank you for raising this up.

Please provide more information to reproduce and analyse the issue:

BPMN workflow
commands that are executed
Zeebe version + client

My first idea is that the message catch event is not entered (i.e. the message subscription is not opened) when the message is published. To ensure that the message is correlated, the message should be published with a time-to-live > 0. See more in the documentation: https://docs.zeebe.io/reference/message-correlation/message-correlation.html#message-buffering

Best regards,
Philipp

Sachin · January 14, 2020, 7:48am

@philipp.ossler, thank you for prompt response.

The workflow is like

Commands executed in order
Task A - after task logic client.newCompleteCommand(job.getKey()).variables(payload).send().join();
the workflow entered into the Intermediate Catch Event but subscription is not made at times (intermittent failure)

Task A - (exposed API)
this.zeebeClient.newPublishMessageCommand().messageName(messageName).correlationKey(corelationKey).timeToLive(Duration.ofMinutes((long)timeToLive)).variables(variables).send().join();
Time to live is 1 min

New finding today
The message subscription got created but after 2 hours of the Task A completion, I can see the event for subscription now with timestamp showing two hours later.
I’ve checked few more cases where the message subcription is getting delayed by 20-60 mins, is it expected? at times we have also seen the exception on broker

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled

We have also observed some message subscription got generated only after broker restart which was done after 4 days of workflow submission

Zeebe version
zeebe-util-0.22.0-alpha1.jar:0.22.0-alpha1

Exported events
Message Published (This happens after task A has completed and there is some user activity involved which triggers message publish event (usually takes 5 mins time after task A completes))
“_index” : “zeebe-record-message_2020-01-13”,
“_type” : “_doc”,
“_id” : “2-16561397058096”,
“_score” : 2.3025851,
“_source” : {
“partitionId” : 2,
“value” : {
“timeToLive” : 60000,
“variables” : {},
“correlationKey” : “6DBC1F29EA6443FB8CC5EF641B7467DD”,
“messageId” : “”,
“name” : “CMTK_Validation_Message”
},
“sourceRecordPosition” : 16561397057368,
“position” : 16561397058096,
“timestamp” : 1578906438486,
“valueType” : “MESSAGE”,
“intent” : “PUBLISHED”,
“recordType” : “EVENT”,
“rejectionType” : “NULL_VAL”,
“rejectionReason” : “”,
“key” : 4503599650233676

Message Deleted Event (after almost 1 min as per TTL)
“_index” : “zeebe-record-message_2020-01-13”,
“_type” : “_doc”,
“_id” : “2-16561397119200”,
“_score” : 2.3025851,
“_source” : {
“partitionId” : 2,
“value” : {
“timeToLive” : 60000,
“variables” : {},
“correlationKey” : “6DBC1F29EA6443FB8CC5EF641B7467DD”,
“messageId” : “”,
“name” : “CMTK_Validation_Message”
},
“sourceRecordPosition” : 16561397118456,
“position” : 16561397119200,
“timestamp” : 1578906516626,
“valueType” : “MESSAGE”,
“intent” : “DELETED”,
“recordType” : “EVENT”,
“rejectionType” : “NULL_VAL”,
“rejectionReason” : “”,
“key” : 4503599650233676

Philipp_Ossler · January 17, 2020, 5:53am

@Sachin, thank you for the insides.

This kind of delay is not expected.

Any idea how to reproduce this issue?
Do you see any other errors in the log file?
Do you observe any other unusual effect?

What is the workload?
How many partitions and brokers are in the cluster?

system · January 31, 2024, 10:10am