Message Subscription event goes missing intermittently

We are experiencing issues with message subscription event missing and the workflow get stuck in processing.
Scenario -
Task A(Task) (publish message event) - > MessageEvent(Task) → Notification(Task)
Task A completes processing and gets the status as COMPLETED in the elastic export event.
But at times the MessageEvent subscription for the same correlation key does not get created, the workflow get stuck
As per the zeebe docs - The message event subscription event occurs when the flow enters in to Message Event

A subscription is opened when a workflow instance awaits a message, for example, when entering a message catch event.

But this at times not appears in the exported event logs (around 10% of the times), any headsup on the same.

@Sachin, thank you for raising this up.

Please provide more information to reproduce and analyse the issue:

  • BPMN workflow
  • commands that are executed
  • Zeebe version + client

My first idea is that the message catch event is not entered (i.e. the message subscription is not opened) when the message is published. To ensure that the message is correlated, the message should be published with a time-to-live > 0. See more in the documentation: https://docs.zeebe.io/reference/message-correlation/message-correlation.html#message-buffering

Best regards,
Philipp

2 Likes

@philipp.ossler, thank you for prompt response.

The workflow is like

Commands executed in order
Task A - after task logic client.newCompleteCommand(job.getKey()).variables(payload).send().join();
the workflow entered into the Intermediate Catch Event but subscription is not made at times (intermittent failure)

Task A - (exposed API)
this.zeebeClient.newPublishMessageCommand().messageName(messageName).correlationKey(corelationKey).timeToLive(Duration.ofMinutes((long)timeToLive)).variables(variables).send().join();
Time to live is 1 min

New finding today
The message subscription got created but after 2 hours of the Task A completion, I can see the event for subscription now with timestamp showing two hours later.
I’ve checked few more cases where the message subcription is getting delayed by 20-60 mins, is it expected? at times we have also seen the exception on broker

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled

We have also observed some message subscription got generated only after broker restart which was done after 4 days of workflow submission

Zeebe version
zeebe-util-0.22.0-alpha1.jar:0.22.0-alpha1

Exported events
Message Published (This happens after task A has completed and there is some user activity involved which triggers message publish event (usually takes 5 mins time after task A completes))
“_index” : “zeebe-record-message_2020-01-13”,
“_type” : “_doc”,
“_id” : “2-16561397058096”,
“_score” : 2.3025851,
“_source” : {
“partitionId” : 2,
“value” : {
“timeToLive” : 60000,
“variables” : {},
“correlationKey” : “6DBC1F29EA6443FB8CC5EF641B7467DD”,
“messageId” : “”,
“name” : “CMTK_Validation_Message”
},
“sourceRecordPosition” : 16561397057368,
“position” : 16561397058096,
“timestamp” : 1578906438486,
“valueType” : “MESSAGE”,
“intent” : “PUBLISHED”,
“recordType” : “EVENT”,
“rejectionType” : “NULL_VAL”,
“rejectionReason” : “”,
“key” : 4503599650233676

Message Deleted Event (after almost 1 min as per TTL)
“_index” : “zeebe-record-message_2020-01-13”,
“_type” : “_doc”,
“_id” : “2-16561397119200”,
“_score” : 2.3025851,
“_source” : {
“partitionId” : 2,
“value” : {
“timeToLive” : 60000,
“variables” : {},
“correlationKey” : “6DBC1F29EA6443FB8CC5EF641B7467DD”,
“messageId” : “”,
“name” : “CMTK_Validation_Message”
},
“sourceRecordPosition” : 16561397118456,
“position” : 16561397119200,
“timestamp” : 1578906516626,
“valueType” : “MESSAGE”,
“intent” : “DELETED”,
“recordType” : “EVENT”,
“rejectionType” : “NULL_VAL”,
“rejectionReason” : “”,
“key” : 4503599650233676

@Sachin, thank you for the insides.

This kind of delay is not expected.

Any idea how to reproduce this issue?
Do you see any other errors in the log file?
Do you observe any other unusual effect?

What is the workload?
How many partitions and brokers are in the cluster?