RESOURCE_EXHAUSTED on Zeebe 0.24

Frank Zhang: Anyone else had the same issue: RESOURCE_EXHAUSTED?

io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Expected to activate jobs of type 'gate-update-instance-failure-reason', but no jobs available and at least one broker returned 'RESOURCE_EXHAUSTED'. Please try again later.

aviv.elmakias.b: I’m having the same issue.
What version of zeebe are you using?

Frank Zhang: @aviv.elmakias.b compile group: ‘io.zeebe.spring’, name: ‘spring-zeebe-starter’, version: ‘0.24.0’

Frank Zhang: If you resolve the issue, please noice me and I also.

aviv.elmakias.b: I’m using zeebe 0.25.1 with zeebe-node@0.23.2 and found out it’s someting to do with the longPoll (maybe a little different name in spring) timeout.
I tries to lower it to 15 seconds from 30 and now i get this exception way less.

aviv.elmakias.b: After running a worker in debug mode i found out that this problem happens only when the grpc stream is closing before the agreed timeout, and after it tries to reconnect.
I guess that it’s thrown after the worker is trying to reconnect but the server already closed the stream.

Frank Zhang: Many thanks. I’m not sure how to configure this variable. I don’t find any timeout config about longpoll. I’m using zeebe 0.24.0, there is only enable config in LongPollingCfg.

aviv.elmakias.b: This is something that should be configured in the worker, not the engine.

Frank Zhang: Thanks, can I set the config like this: @ZeebeWorker(type = “xxx-convert”, timeout = 15 )

Frank Zhang: • Resource exhausted on create workflow instance if no topology available (https://github.com/zeebe-io/zeebe/issues/5789|#5789)

aviv.elmakias.b: I dont think it’s the same issue we have here, but maybe they’re related somehow.
I get RESOURCE_EXHAUSTED without activating any workflow, just after starting the workers and connecting them to the engine.

Josh Wulf: Is there any workload in the engine (running process instances)? And how many workers are you connecting?

Note: This post was generated by Slack Archivist from a conversation in the Zeebe Slack, a source of valuable discussions on Zeebe (get an invite). Someone in the Slack thought this was worth sharing!

If this post answered a question for you, hit the Like button - we use that to assess which posts to put into docs.

aviv.elmakias.b: @Josh Wulf I’m running like 5 different workers, each of them handle like 5-10 tasks, and around 30 workflows deployed to zeebe (which is running on a statefull set with 3 replicas, and 1 zeebe standalone gateway. Although the problem also occur when running zeebe as a single docker container)
I’m getting this error right after deploying an environment like this one, without actually starting any workflow.

Josh Wulf: So you have 25-50 polling requests going on?

AFAIK, Each activate jobs request goes into the stream. If the stream processor lags (the request stream is increasing in size faster then the stream processor is processing it) then the Gateway sends the GRPC Status Code 8 to incoming requests that are not job complete commands.

aviv.elmakias.b: It could be it.
Although the tasks actually work when activating a workflow, so i guess it throws the exception but starting a new stream right after?
So adding more cpu to the gatway should fix this? Or is there other properties that should be updated as well?

Josh Wulf: There is also this: https://github.com/zeebe-io/zeebe/issues/3367

aviv.elmakias.b: @Josh Wulf I think it’s other case of ‘RESOURCE_EXHAUSTED’, not what’s happening here.
Did you tried to reproduce it?

aviv.elmakias.b: I’m still having the same issue with 0.26.0, didn’t try to run anything but just after deploying the new environment i get the same error.

Josh Wulf: If you give me an exact set of steps to reproduce - or even better, a minimal reproducer repository - then I can have a crack at reproducing and diagnosing it

aviv.elmakias.b: I’ll try to reproduce it in the near future and open an issue about it

Frank Zhang: @aviv.elmakias.b The issue still exists after upgrade zeebe to 0.26.0.:zany_face:

Frank Zhang: Some zeebe culster logs say “Failed to complete Broke
rActivateJobsRequest” since “call already closed”.

Frank Zhang:

2021-02-05 00:03:33.429 [GatewayLongPollingJobHandler] [gateway-scheduler-zb-actors-0] WARN  io.zeebe.gateway - Failed to complete Broke
rActivateJobsRequest{requestDto={"type":"gate-email-convert","worker":"worker-name","timeout":300000,"maxJobsToActivate":32,"jobKeys":[]
,"jobs":[],"variables":[],"truncated":false}}
java.lang.IllegalStateException: call already closed
        at com.google.common.base.Preconditions.checkState(Preconditions.java:508) ~[guava-30.0-jre.jar:?]
        at io.grpc.internal.ServerCallImpl.closeInternal(ServerCallImpl.java:209) ~[grpc-core-1.34.0.jar:1.34.0]
        at io.grpc.internal.ServerCallImpl.close(ServerCallImpl.java:202) ~[grpc-core-1.34.0.jar:1.34.0]
        at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onCompleted(ServerCalls.java:380) ~[grpc-stub-1.34.0.jar:1.34.0]
        at io.zeebe.gateway.grpc.ErrorMappingStreamObserver.onCompleted(ErrorMappingStreamObserver.java:110) ~[zeebe-gateway-0.26.0.jar:
0.26.0]

Frank Zhang:

2021-02-05 00:03:33.433 [GatewayLongPollingJobHandler] [gateway-scheduler-zb-actors-0] ERROR io.zeebe.util.actor - Uncaught exception in
 'GatewayLongPollingJobHandler' in phase 'STARTED'. Continuing with next job.
java.lang.IllegalStateException: call already closed

Josh Wulf: That sounds like the client is closing the request, and making another one. It would be useful to see how many JobActivationRequests are being sent. Are you able to get that via the TRACE level logging?

Frank Zhang: We just test in Dev environment, there are only serval workflow instance, it seems less than maxJobsToActivate:32

Josh Wulf: If the client is making a call to activate jobs, then terminating the call before the server responds and making another one, you would see that error in the gateway logs, and also the broker could be flooded by requests, leading to RESOURCE_EXHAUSTED (GRPC Status 8).

Josh Wulf: I also found this: https://github.com/zeebe-io/zeebe/issues/6055#issuecomment-755192933