Is grpc a point of failure when embedding the workflow in a program?

jcavieres · May 31, 2019, 4:13am

Hi.

A little context for my question: I got here after reading several blogs from Bernd Rücker where he discuses the workflows capabilities of zeebe as a mean to deal with the inherent problems of the IP network. Basically he proposes to embed the zebee workflow in the app to deal with retrys, timeous and transactions for instance.

So far, so good, but isn’t that contradictory with the fact that the workflow is not really embedded in the application and every command/response used in the client application is being sent over the network using GRPC and can have the same problems that I am trying to avoid?

Thanks in advance
JCS

jwulf · May 31, 2019, 9:14am

Hi @jcavieres, without an actual quote from Bernd in context, I’m going to channel him here…

I think you have mashed up a few different ideas here. If you want to do away with network problems, then you have to do away with the network and put everything in-memory on one machine.

Zeebe does not do away with the issues of the network. It does away with many of the issues of dealing with the network, by making them a concern of the broker, rather than the shared responsibility of every single service in the system.

One of the issues with a peer-to-peer choreographed REST MSA architecture is that it is the responsibility of each service to deal with the failure of another part that it is coupled with. Thus you have things like the circuit breaker pattern - which you have to implement at every service boundary where coupling occurs.

You can go some way to mitigating this with asynchronous messaging over something like RabbitMQ, where the broker can buffer messages while another system recovers.

Now you move the concern of dealing with failure out of your service, and your service code is reduced to business logic.

But you still have no way to know what the system does, without up-to-date documentation or reading the source code and correlating it with the other services / contract specifications (hoping they are up to date).

So Zeebe is a solution to these two problems. You move the coordination, with all its meta-work, out of the services, and make them stateless with a simple worker interface of pulling tasks and completing tasks. And you have executable (guaranteed up-to-date) documentation of your system’s operation in the BPM diagram.

You are going to have to deal with complexity, resilience to failure, and scalability somewhere in your system. The premise of Zeebe is that some of these are best dealt with in one place, rather than everywhere. Acknowledge that your business process is in fact stateful, and have a state management component - Zeebe; while reaping the scalability benefits of stateless workers.

Yes, if the connection between the worker and the broker goes down, no work happens. So, Zeebe allows multiple distributed stateless workers to mitigate against this.

What if the broker goes away? Well, it’s clustered and replicates data, so you can survive the disappearance of one or two nodes (depending on your node count).

What if the data centre disappears in a meteorite strike?

OK, then you might have a problem - with any architecture.

hope this helps,
Josh

jcavieres · May 31, 2019, 4:32pm

Thanks for the quick response, but I still don’t understand.
Let me give you an example.
Imagine I am creating my microservice in Golang and want to use zeebe to deal with network problems statefully, so I add the following lines :

const (
zeebeBrokerAddr = “192.168.11.10:51015”
)
newClient, err := zbc.NewClient(zeebeBrokerAddr)

What is happening under the hood is a http connection (grpc) from my client to the zeebe broker:
https://docs.zeebe.io/basics/client-server.html

But the network is not perfect and any of these 3 errors may happen (2 of them can’t be fixed by the gateway/broker, because the issue is in the network ).

Screenshot%20from%202019-05-31%2011-39-59

As far as I know, grpc currently doesn’t handle well network related problems
https://github.com/grpc/proposal/blob/master/A6-client-retries.md

So, if there’s a network problem GRPC may fail and then my workflow never gets executed and as a consecuence my entire microservice fails for the same reasons I consider adding zeebe in the first place.

What am I missing?

jwulf · May 31, 2019, 10:22pm

In the JavaScript library, if the call to createWorkflowInstance fails at the gRPC transport layer, it throws an exception.

You just catch that exception and implement retry logic. Idempotent start messages can be used to make sure you don’t start the same workflow instance twice.

I think the impedance mismatch lies in that Zeebe is not a solution to network disruption. Part of its philosophy is a reduction in the complexity of your solution that deals with network disruption.

You have to design for network failure somehow. Zeebe does not eliminate network failure. If the network fails, the network fails, whether you are using REST over HTTP or gRPC over HTTP. How your system handles that, works around that, and recovers from it is part of your solution design. A Zeebe-based solution has a particular approach to this. As well, Zeebe addresses a number of other issues in a microservices architecture.

You need to consider the entire architecture to determine if it solves problems that you have.

Zeebe will not magically make a broken connection between a worker and a broker work.

If your microservice is starting a workflow, then it will need to retry on a network failure, because the state of your business process is in the microservice.

Unless that microservice was triggered from a workflow running in Zeebe. In that case, if it does not report back to the broker, the broker will time it out and pass its work to another worker, which will then start the workflow.

If it is a microservice that starts a workflow in response to a REST request from some outside system, then it needs to implement retry if the call to the broker gateway fails - just like it would for any other transport or system that it communicates with over the network.

Once the workflow is started, microservices that participate in the workflow communicate with the broker, and the broker handles timing them out and reassigning their work to another worker if their network connection fails.

If the broker is unreachable by services, then all work will stop - so your system does have a single point of failure. The benefit of this is that you can concentrate your fail-over strategy on a single component. The disadvantage is that you have one “all-or-nothing” failure point.

jcavieres · June 2, 2019, 3:10am

Thanks again .
I believe that Zeebe would be a much better solution if Zeebe developers and not theirs users handled the retries/delays needed to make the clients connection to the broker via GRPC more tolerant to network faults.

As can be read in
https://github.com/grpc/proposal/blob/master/A6-client-retries.md#exponential-backoff

Others have already done a lot to improve the default behavior of GRPC and those changes could be added to the default Zeebe clients.

Regards

jwulf · June 2, 2019, 5:01am

Hmmmm… I see what you are saying.

At the moment in the JavaScript library - which I maintain - all retry logic is the responsibility of the application code.

I can see that DRY-ing configurable retry logic into the client lib layer makes more sense than reimplementing it for every single use in the application.

Right now, for both ZBClient.createWorkflowInstance(), and the worker task handler callback complete.success()/complete.failure() the lib throws Error: 14 UNAVAILABLE: failed to connect to all addresses when the broker is uncontactable.

With the worker callback: we know how long the worker has to complete the task, because it is configured with a timeout when it is created in the application code. It communicates that to the broker when it activates jobs: “You should time me out if I don’t complete this task in n milliseconds”.

So we could retry the success()/failure() call repeatedly until we know the broker will have timed the task out, then throw in the client. We know the broker will retry the task with another worker, and the application code can do any clean-up or error reporting it wants to in an exception handler.

For the createWorkflowInstance call, the desired behaviour is less clear. Maybe give it some options: retries and retryDelayMillis? And do the retries in the client lib and throw on the final failure?

jwulf · June 2, 2019, 9:19am

I did a little work on what it would look like to implement this in the client lib.

It complicates the internal structure significantly.

I wonder about the specific failures that it is mitigating against - I don’t want to prematurely optimize without a specific use-case.

Points of failure client-broker over gRPC

There are three points of contact:

Client initiating an operation on the broker

failJob
publishMessage
createWorkflowInstance
resolveIncident

Worker activating jobs

activateJobs

Worker completing / failing job

completeJob
failJob

Current Failure Modes

Looking at the current behaviour of each one:

Client initiating an operation on the broker.

If the client is started and the broker is not contactable, any operations will throw.
If the broker becomes available, operations succeed.
If the broker goes away, operations throw again.
No automatic retries.

Worker activating jobs

If the worker is started and the broker is not contactable, an error is printed to the console.
If the broker becomes available, the worker activates jobs.
If the broker goes away, no error is thrown or printed.
If the broker comes back, jobs are activated.
In this case, the worker polling is an automatic retry.

Worker completing jobs

If the broker goes away after the worker has taken a job, the worker throws Error: 14 UNAVAILABLE when it attempts to complete the job.
No automatic retry.

Ways the broker could go away / be unavailable:

Broker address misconfigured.
Transient network failure.
Broker under excessive load.

Broker Address Misconfigured
This is an unrecoverable hard failure. Retries will not fix this.

Transient network failure
Some temporary disruption in connectivity between worker and broker. This could include a broker restarting or (potentially) a change in DNS (have to test this).
A retry will deal with this case if the transient network failure is fixed before the retries timeout.

Broker is under excessive load and cannot respond
In this case, retries may actually make it worse. Zeebe is horizontally scalable, but I have driven it to failure on a single node by pumping in a massive number of workloads when it is memory starved (I can kill it with 2GB of memory, but haven’t yet with 4GB) or runs out of disk space (a slow exporter with a high through-put can do this). Having automated retries will not recover any of these situations.

If the broker is experiencing excessive load because of a traffic spike, then automated retries may drive it to failure, whereas workers failing to complete tasks once and letting the broker reschedule them may allow the broker to recover.

Other failure modes not distinguished
The as-yet unknown unknowns. Any ideas?

Conclusions

I’m not yet sure that automatic retry is (a) necessary; (b) a good idea.

The transient network failure seems to be the only case. I’m not sure how much of an issue that is in an actual system, and if it warrants complicating the code, or the potential downside of hammering a broker when it is experiencing excessive load (which will be either ineffective if it is a hard failure, and could contribute to a hard failure if it is a spike).

I’m open to more data on this, but I don’t have a case for implementing retries yet.

jcavieres · June 3, 2019, 3:41pm

Regarding if preparing for the “transient network failure” is reason enough to work in this issue I will refer you (again) to Bernd Rücker’s blog where he discuses “The 8 Fallacies of distributed computing”:

All the entries:
https://blog.bernd-ruecker.com/3-common-pitfalls-in-microservice-integration-and-how-to-avoid-them-3f27a442cd07

https://blog.bernd-ruecker.com/fail-fast-is-not-enough-84645d6864d3

(couldn’t add more links for being a new user)

So if Zeebe doesn’t include protection for network failures is the obligation of each developer using Zeebe to implement such protection and I think that it doesn’t make sense in this case.

jwulf · June 3, 2019, 4:35pm

Maybe… We still don’t have a concrete use case to assess. Do you have one in mind?

All engineering is a cost/benefit analysis.

A lot of the surface area for failure is reduced in the Zeebe architecture, and a number of failure mode retries are handled by the broker.

The whole point is that by adopting the Zeebe architecture, a lot of that is taken care of. Yes, there is still the possibility of failure - but most of the scenarios involve failure modes that are not obviously best addressed by adding automated gRPC retry.

Yes, the network is unreliable. For example, the broker may go away. It’s not a foregone conclusion that the best solution to that is automated retry. In many cases of that level of failure, it will be a DevOps operation to bring it back (including, potentially, Google Cloud Platform engineers trying to bring their availability zone back) - and the issue that we are solving for is how the state of the business process is retained through the failure and reinstated when the component / connection is restored. And also what happens to pressure in the system when this happens.

When the state lives nowhere and everywhere, as it does in peer-to-peer choreography, then failure in a component threatens the business process in a particular way - you need to keep the state alive where it is, and retry is essential. So, in that architecture the benefit is clear, and the cost is clearly worth paying.

I did some more investigation of the engineering effort involved in adding retry to the JavaScript library. It looks like it would involve implementing clientInterceptor calls for the grpc-node library in the node-grpc-client library, which is downstream of node-grpc and upstream of zeebe-node.

The fact that they haven’t implemented it already is an indication (not proof) that it is not needed in practice sufficiently to have driven its development. I would be implementing it upstream without a use-case downstream driving it - which makes the specific implementation speculative in both its necessity and the form it should take.

I talked it over with one of the main users of zeebe-node, and at the moment, in their use-case, the rate of that failure mode and its impact isn’t sufficient to warrant it. They persist state at boundaries and use explicit retry to protect against memory pressure. If the broker fails, the last thing you want is cascading failure as memory is exhausted. Then you will lose all in-memory state. Zeebe itself is designed internally to protect against this happening within the broker boundary through an append-only event log on disk plus replication across multiple nodes.

There are other areas of the system, and other recovery modes that are more appropriate in a Zeebe system.

That’s not to say that some other failure mode may emerge where retry makes sense - but at the moment it looks like you won’t be manually coding retries at that level. When a call fails, some component is in a hard failure mode, and there is no state for you to retain in the service.

One exception to this that I can see is on the boundary of the system where you trigger workflows from an external system. Long-running processes there may hold the state of the business process in memory and retry operations over a long period of time while they wait for the broker to come back.

Personally, I would put “state outside the engine that cannot be lost” in a database or queue and explicitly retry with business logic.

Again, I’m open to doing the engineering to put automated retry in the Node library, but it would need to be driven by an actual use-case, balanced against other ways to handle the failure mode.

So if you actually need it, I’m happy to write it.

And: thanks for generating this conversation! This is an aspect that I hadn’t looked into at this level of detail.

jwulf · June 3, 2019, 11:45pm

I found another Node gRPC library that includes a simple linear retry feature: https://github.com/bojand/grpc-caller.

Rebasing the zeebe-node library on that (assuming everything else works) would probably be a lower-orbit solution than adding support to node-grpc-client. So the engineering cost may be less. Still need a concrete use-case to do the coding. DRY-ing out retry handling from application code, if it actually shows up, would be the driver to do it. Let me know if it is a problem in practice.

I’d say the same economics apply to the Java and Go clients - I don’t maintain those, but I know the maintainers are amenable to PRs.

jcavieres · June 4, 2019, 2:05pm

Hi Josh, I am a little confused about the “use case” definition in this case besides handling network interruptions gracefully. If what you’re looking for is how often they happen, the attached report can make the case.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwijrfWP_8_iAhVDpFkKHbm7C8oQFjAAegQIABAC&url=http%3A%2F%2Fwww.bailis.org%2Fpapers%2Fpartitions-queue2014.pdf&usg=AOvVaw2AkkQ1R10Jw_wVAd9JEl76

jwulf · June 4, 2019, 2:22pm

Hi @jcavieres, a use-case is a concrete scenario - ideally with proof-of-concept code / a reproducible architecture that we can test against. People are writing proof-of-concept, (and even production) systems with Zeebe at the moment, and we’re prioritising engineering efforts on solving the actual issues that users are encountering.

Are you using Zeebe at the moment, and has this been a problem for you in a specific scenario?

Or, have you developed a gRPC-based system before and encountered this as an issue previously?

afirth · June 5, 2019, 11:33am

I think the reason this is not “baked in” to the grpc libraries is because the organization that developed it typically configures network behaviour outside the application. That’s also becoming best practice in http applications. This is because the application may be used in various scenarios, with different levels of criticality and latency sensitivity. For example, you might ship logs with an aggressive exponential backoff quickly increasing the time between retries, but a high retry count. On the other hand you might load an email or complete a search request with a short burst of retries before surfacing an error to the user. In some cases, you might not want any retries if your workload is not idempotent - the bottom arrow in your diagram would cause the message to be executed twice by the service provider. Technologies that implement this for GRPC include middleware libraries, proxies such as envoy, and service meshes such as linkerd and istio (which incorporates envoy).

My 2c is that implementing this in the zeebe client is 1) reimplementing things already available in middleware and sidecars, and 2) might solve one use case but can’t solve them all, because zeebe is general purpose.

You can see an example of how this can reduce the impact of outages here: Google Cloud Blog - News, Features and Announcements - small critical workloads continued to mostly function, while non-critical traffic was dropped.

grpc/proposal/blob/master/A6-client-retries.md

gRPC Retry Design
----
* Author(s): [Noah Eisen](https://github.com/ncteisen) and [Eric Gribkoff](https://github.com/ericgribkoff)
* Approver: a11r
* Status: Implemented
* Implemented in: Java, .NET, Go except hedging, and C-Core except hedging
* Last updated: 2022-02-14
* Discussion at: https://groups.google.com/forum/#!topic/grpc-io/zzHIICbwTZE

Table of Contents
----
  * [Abstract](#abstract)
  * [Background](#background)
  * [Proposal](#proposal)
     * [Overview](#overview)
     * [Detailed Design](#detailed-design)
        * [Retry Policy Capabilities](#retry-policy-capabilities)
           * [Maximum Number of Retries](#maximum-number-of-retries)
           * [Exponential Backoff](#exponential-backoff)
           * [Retryable Status Codes](#retryable-status-codes)

This file has been truncated. show original

https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-retry-grpc-on

As always, I’m also very interested in your use case

jwulf · June 19, 2019, 8:17pm

I found a concrete use-case. A race condition in containers coming up can cause a Node worker container to exit if it attempts to deploy a workflow before the gateway is ready to accept requests, and the error is not handled by the application.

This is a particularly pernicious problem, because it is a race condition. I investigated how to handle that via client-side retry logic.

I’ve created a PR against the Node library to add configurable client-side retry for gRPC error code 14: Transient Network Failure, which is the equivalent of HTTP 503: SERVICE UNAVAILABLE.

jwulf · October 25, 2019, 8:57am

As an update to this, the Node client has some pretty sophisticated client-side retry functionality now, buffering commands and messages when the connection fails, and doing a back-off retry.

It also has callback hooks for onReady and onConnectionFailed. I’ve been using it against Camunda Cloud, which at the moment reschedules pods, causing intermittent connection failures. The system is resilient with the Node client retry functionality, and recovers with no intervention.

system · January 31, 2024, 10:12am