I have some trouble to Cancel and Delete WorkflowInstances in Operate.
For example I have an “Echo” WorkflowInstance with ID 225. Trying to cancel it, Operate get a Timeout and show “Canceleling Instance 225 failed”.
In the Zeebe Debug Log:
12:36:53.890 [io.zeebe.gateway.impl.broker.BrokerRequestManager] [gateway-zb-actors-0] ERROR io.zeebe.gateway - Error handling gRPC request
io.grpc.StatusRuntimeException: NOT_FOUND: Command rejected with code ‘CANCEL’: Expected to cancel a workflow instance with key ‘225’, but no such workflow was found
at io.grpc.Status.asRuntimeException(Status.java:523) ~[grpc-core-1.19.0.jar:1.19.0]
at io.zeebe.gateway.EndpointManager.convertThrowable(EndpointManager.java:257) ~[zeebe-gateway-0.17.0.jar:0.17.0]
at io.zeebe.gateway.EndpointManager.lambda$sendRequest$2(EndpointManager.java:235) ~[zeebe-gateway-0.17.0.jar:0.17.0]
at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequest$1(BrokerRequestManager.java:90) ~[zeebe-gateway-0.17.0.jar:0.17.0]
at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequest$3(BrokerRequestManager.java:109) ~[zeebe-gateway-0.17.0.jar:0.17.0]
at io.zeebe.gateway.impl.broker.BrokerRequestManager.lambda$sendRequestInternal$6(BrokerRequestManager.java:191) ~[zeebe-gateway-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.future.FutureContinuationRunnable.run(FutureContinuationRunnable.java:35) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:90) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:53) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:189) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:154) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:135) [zeebe-util-0.17.0.jar:0.17.0]
at io.zeebe.util.sched.ActorThread.run(ActorThread.java:112) [zeebe-util-0.17.0.jar:0.17.0]
Caused by: io.zeebe.gateway.cmd.BrokerRejectionException: Command (CANCEL) rejected (NOT_FOUND): Expected to cancel a workflow instance with key ‘225’, but no such workflow was found
I think the Elasticsearch and Zeebe datas are not in sync but I think we should be able to “force delete” a WorkflowInstance from ElasticSearch.
Hi @gizmo84, thanks for the report, and we’ll look into this. Just to clarify–you can see workflow instance ID 225 in Operate, but it seems that Zeebe is in some way out of sync?
A possibility that I’ll throw out there: could it be that you submitted the cancellation request multiple times, but there was a lag in the UI, and this NOT FOUND error was the result of one of the additional cancellation requests? Or did the workflow instance not ever cancel? Let me know if that makes sense.
Hi @gizmo84, thanks for the screenshot. That helps. To recap:
Operate is still showing the instance as “running” (a green circle to the left of the workflow name as in your screenshot)
But you can’t cancel the instance in Operate even though Operate says it’s running
And when you try to cancel the instance in Operate, you see the error in the Zeebe logs that you included in your first post
Did I get all of that right? If so, then I’ll take this to the Zeebe and Operate teams because it sounds like something unexpected is happening.
And to give some quick background on expected behavior:
Workflow instances currently cannot be deleted, only canceled, and after cancellation, they’ll be visible in Operate if “Canceled” is selected in the Filters menu (screenshot)
A canceled workflow instance will eventually be cleaned up from Zeebe state but will still be available in Operate
I created a instance of my test workflow, and ran a few workers of tasks in workflow.
The workflow was not finish yet and I click the cancel button in operate page.
The borker log said “io.grpc.StatusRuntimeException: NOT_FOUND: Command rejected with code ‘CANCEL’: Expected to cancel a workflow instance with key ‘4503599627374654’, but no such workflow was found”.
The instance was hanging there, it could be deleted and removed from the page.
Hi @i.m.superman, can you please post a minimal reproducer: that’s the minimum number of exact steps to reproduce the problem (including a Git Repo with the worker code). To be able to debug this for you, we’d need to see it happening, and see exactly what you are doing that has it happen.
OK, do you see the job being output to the console in your worker when you start it?
One thing to note is that the representation in Operate lags behind the state of the system, because it comes from the exporter. So it does not update in real-time.