Hello @Zelldon, thank you for your response.
I understand now that connection problems between the broker and workers don’t fail the workflow execution, thank you for the explanation.
However, I’m still trying to figure out what would be the best approach, from an operational standpoint, for dealing with transient errors, even if they don’t occur between the broker and workers.
For example, take a scenario in which a job worker communicates with a database that is under heavy load for a few minutes causing the worker to timeout repeatedly, throwing an exception. From what I understand, Zeebe would still exhaust the number of retries for that job, which would lead to an incidence and halting of the workflow instance execution.
Once the database became available, the jobs could become eligible again for execution, however for the previously halted instance to resume, it would take manual intervention, despite the error being a transient one.
I’m wondering if my understanding is correct, and if you have any insights into how one can automate an operational procedure for this sort of situation. In case several jobs failed, it would also be impractical to resume workflow execution through a GUI such as Operate provides.
I’m thinking a possible approach would be to create a background task which would continually monitor the system for instances (for example, using zeeqs) in an “incidence state”, and activating each instance again periodically.
However, from what I understand zeeqs imports data from the hazelcast exporter, not a persistent store, which could cause data to get lost in case the ring buffer fills up before the data is persisted. This solution also doesn’t seem ideal for a production scenario.
I would appreciate any tips you may have regarding these scenarios I’ve described.