AWS Fargate - Unhealthy Partition

Hello, we have deployed the latest zeebe docker image to AWS Fargate and it keeps restarting endlessly due to the following error, which happens after 2 minutes of restarting the container:

2020-11-04 06:07:48.463 [Broker-0-HealthCheckService] [Broker-0-zb-actors-1] ERROR - Partition-1 failed, marking it as unhealthy

After that it shuts down the broker, this repeats infinitely as Fargate tries to restart it after failure. The setup is, so far, just one standalone broker with gateway enabled and default configuration.

Any help will be highly appreciated.


Anyone? Can you maybe give me a hint regarding what triggers the error regarding a partition being marked as unhealthy?

I haven’t used Fargate, but here are some speculations:

  • Insufficient resources? Disk? Memory?
  • The partition log directory not mounted correctly?

It will be some kind of environmental condition in Fargate. You can verify this by deploying the exact same config locally to see if the problem presents outside the Fargate env.

thanks @jwulf. I have deployed locally and everything works, so it is definitely a problem with fargate. I can’t seem to be able to determine what exactly is causing problems. How can I determine if the log directory partition is mounted correctly?

I have made some “progress”, I attached EFS storage and mounten the data dir to it. Now the error changed:

    2020-11-11T04:27:53.753+01:00	2020-11-11 03:27:53.740 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.zeebe.logstreams - Actor Broker-0-StreamProcessor-1 failed in phase STARTED.
2020-11-11T04:27:53.753+01:00	io.zeebe.db.ZeebeDbException: Unexpected error occurred during RocksDB transaction.
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInTransaction( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransactionDb.ensureInOpenTransaction( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransactionDb.whileTrue( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.TransactionalColumnFamily.whileTrue( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.engine.state.message.MessageState.visitMessagesWithDeadlineBefore( ~[zeebe-workflow-engine-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at ~[zeebe-workflow-engine-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorJob.invoke( ~[zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorJob.execute( [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorTask.execute( [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorThread.executeCurrentTask( [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.util.sched.ActorThread.doWork( [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at [zeebe-util-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	Caused by: org.rocksdb.RocksDBException: IOError(StaleFile)
2020-11-11T04:27:53.753+01:00	at org.rocksdb.Transaction.commit(Native Method) ~[rocksdbjni-6.13.3.jar:?]
2020-11-11T04:27:53.753+01:00	at org.rocksdb.Transaction.commit( ~[rocksdbjni-6.13.3.jar:?]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.commitInternal( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInNewTransaction( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	at io.zeebe.db.impl.rocksdb.transaction.DefaultDbContext.runInTransaction( ~[zeebe-db-0.25.1.jar:0.25.1]
2020-11-11T04:27:53.753+01:00	... 13 more

@MarceloEmmerich now it might be a permission problems… it might be finding the right path but not able to write to it.

Thanks all, I eventually gave up and set up a cluster in EKS instead. Now everything works fine.

1 Like

Hi Marcelo,

Could you please help me with the steps which you followed to fix this issue. i am struck in the same problem but no clue what to do. if you help me with steps which you followed that would be great help.

can you please help me with steps which you followed.

Hi, I could not solve the problem and gave up trying to run a cluster on Fargate. Instead, I set up a Kubernetes cluster in EKS and used the available helm charts as s starting point to further customize my setup. This has been working perfectly so far.

1 Like