@philipp.ossler Thank you for the welcome!
@philipp.ossler / @arjunkumar09 I’ll outline the architecture as a first pass and see where it takes us if that’s ok?
We have a number of services running in AWS in a Fargate configuration within the same cluster. Zeebe is one of the said services. At present - and I’m sure this will be the primary target for scrutiny/advice - we’re running Zeebe in a stand-alone configuration, i.e. broker and gateway running in a single container.
In order to facilitate a persistent filesystem across Zeebe upgrades etc… the container mounts an EFS volume. Ironically, perhaps, this is where problems tend to present themselves. If we’re running Zeebe with a clean slate - empty EFS volume, no indexes in ES etc… - it’s perfectly fine. We can even redeploy a number of times - as yet unmeasured - before we encountered this problem.
In order to get things back up and running we’re manually “cleaning the slate” on the EFS volume and deleting ES indexes. When Zeebe restarts all is well.
The only thing I can propose as a contributing factor is the behaviour around the EFS mount and the shutdown process, perhaps?