How to deal with transient failures without manual intervention

From what I understand, jobs in Zeebe are retried a certain number of times before an incident is reported, and after that a manual intervention is required in order to resolve the incident.

But in a scenario in which there occurs a connectivity issue during the night, between my Zeebe cluster and the machine where my workers are running, would there be a way to guarantee the workers would resume processing automatically after connectivity is reestablished, despite the job’s retries having been exhausted? Otherwise, the workers would remain stopped the entire night, until someone was alerted by the error and resumed each workflow instance manually by resolving the incidences.

I would like to know if my understanding is correct, and if so, what’s the best way to handle this sort of situation?

Hey @pedrompc

you need to differentiate between business errors and technical/environment errors, like network errors.

If you worker fail or there is an exception in the worker you can send a fail command for the corresponding job. If this happens multiple times an incident is raised. This is something you can control.

If there is a network outage between your worker and zeebe this doesn’t mean that the jobs are failed. The jobs just time out and are almost immediate available again for activiation, so another worker can pick them up.

Hope that helps.

Greets
Chris

Hello @Zelldon, thank you for your response.

I understand now that connection problems between the broker and workers don’t fail the workflow execution, thank you for the explanation.
However, I’m still trying to figure out what would be the best approach, from an operational standpoint, for dealing with transient errors, even if they don’t occur between the broker and workers.

For example, take a scenario in which a job worker communicates with a database that is under heavy load for a few minutes causing the worker to timeout repeatedly, throwing an exception. From what I understand, Zeebe would still exhaust the number of retries for that job, which would lead to an incidence and halting of the workflow instance execution.
Once the database became available, the jobs could become eligible again for execution, however for the previously halted instance to resume, it would take manual intervention, despite the error being a transient one.

I’m wondering if my understanding is correct, and if you have any insights into how one can automate an operational procedure for this sort of situation. In case several jobs failed, it would also be impractical to resume workflow execution through a GUI such as Operate provides.

I’m thinking a possible approach would be to create a background task which would continually monitor the system for instances (for example, using zeeqs) in an “incidence state”, and activating each instance again periodically.
However, from what I understand zeeqs imports data from the hazelcast exporter, not a persistent store, which could cause data to get lost in case the ring buffer fills up before the data is persisted. This solution also doesn’t seem ideal for a production scenario.

I would appreciate any tips you may have regarding these scenarios I’ve described.

Thank you,
Pedro

Hey @pedrompc

if you know about possible transitional failures how about catching them?
If your worker doesn’t throw an exception the job will also not be failed.

Greets
Chris

Hello @Zelldon

That’s a good suggestion, it hadn’t occurred to me that I can simply not send a a job complete or fail command, which will cause the jobs to be retried indefinitely.

Thank you for the help.

1 Like