AWS Fargate - Unhealthy Partition

MarceloEmmerich · November 4, 2020, 6:19am

Hello, we have deployed the latest zeebe docker image to AWS Fargate and it keeps restarting endlessly due to the following error, which happens after 2 minutes of restarting the container:

2020-11-04 06:07:48.463 [Broker-0-HealthCheckService] [Broker-0-zb-actors-1] ERROR io.zeebe.broker.system - Partition-1 failed, marking it as unhealthy

After that it shuts down the broker, this repeats infinitely as Fargate tries to restart it after failure. The setup is, so far, just one standalone broker with gateway enabled and default configuration.

Any help will be highly appreciated.

Thanks

MarceloEmmerich · November 6, 2020, 9:08am

Anyone? Can you maybe give me a hint regarding what triggers the error regarding a partition being marked as unhealthy?

jwulf · November 7, 2020, 12:56pm

I haven’t used Fargate, but here are some speculations:

Insufficient resources? Disk? Memory?
The partition log directory not mounted correctly?

It will be some kind of environmental condition in Fargate. You can verify this by deploying the exact same config locally to see if the problem presents outside the Fargate env.

MarceloEmmerich · November 7, 2020, 9:36pm

thanks @jwulf. I have deployed locally and everything works, so it is definitely a problem with fargate. I can’t seem to be able to determine what exactly is causing problems. How can I determine if the log directory partition is mounted correctly?

MarceloEmmerich · November 11, 2020, 3:31am

I have made some “progress”, I attached EFS storage and mounten the data dir to it. Now the error changed:

    2020-11-11T04:27:53.753+01:00	2020-11-11 03:27:53.740 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor Broker-0-StreamProcessor-1 failed in phase STARTED.
2020-11-11T04:27:53.753+01:00	io.zeebe.db.ZeebeDbException: Unexpected error occurred during RocksDB transaction.
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInTransaction(DefaultDbContext.java:142) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransactionDb.ensureInOpenTransaction(ZeebeTransactionDb.java:173) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransactionDb.whileTrue(ZeebeTransactionDb.java:304) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue(TransactionalColumnFamily.java:93) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue(TransactionalColumnFamily.java:147) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue(TransactionalColumnFamily.java:84) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.engine.state.message.MessageState.visitMessagesWithDeadlineBefore(MessageState.java:271) ~[zeebe-workflow-engine-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.engine.processing.message.MessageTimeToLiveChecker.run(MessageTimeToLiveChecker.java:32) ~[zeebe-workflow-engine-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:76) ~[zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:107) [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:91) [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:204) [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	Caused by: org.rocksdb.RocksDBException: IOError(StaleFile)
2020-11-11T04:27:53.753+01:00	at org.rocksdb.Transaction.commit(Native Method) ~[rocksdbjni-6.13.3.jar:?]
2020-11-11T04:27:53.753+01:00	at org.rocksdb.Transaction.commit(Transaction.java:206) ~[rocksdbjni-6.13.3.jar:?]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.commitInternal(ZeebeTransaction.java:117) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInNewTransaction(DefaultDbContext.java:164) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInTransaction(DefaultDbContext.java:135) ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	... 13 more

salaboy · December 8, 2020, 3:40pm

@MarceloEmmerich now it might be a permission problems… it might be finding the right path but not able to write to it.

MarceloEmmerich · December 8, 2020, 7:07pm

Thanks all, I eventually gave up and set up a cluster in EKS instead. Now everything works fine.

eknaprasath · January 2, 2021, 8:00am

Hi Marcelo,

Could you please help me with the steps which you followed to fix this issue. i am struck in the same problem but no clue what to do. if you help me with steps which you followed that would be great help.

eknaprasath · January 2, 2021, 8:01am

can you please help me with steps which you followed.

MarceloEmmerich · January 5, 2021, 10:18pm

Hi, I could not solve the problem and gave up trying to run a cluster on Fargate. Instead, I set up a Kubernetes cluster in EKS and used the available helm charts as s starting point to further customize my setup. This has been working perfectly so far.

ankit_joinwal · August 12, 2021, 4:45pm

Did you use ALB with gRPC for the zeebe gateway ? If yes how did you define health check in ALB ? Does Zeebe Gateway expose a health check service?

system · January 31, 2024, 10:10am