Add queries about Zeebe DB data in Zeebe API

vtexier · July 25, 2020, 9:25am

Zeebe API only expose a few “write” query for now (0.23.1) and, surprisingly, no read-only queries (except for topology).

Playing with Zeebe and the Elasticsearch exporter, I have always kept Zeebe data, but deleted many times Elasticsearch data. It leads to inconsistent state between Zeebe and Elasticsearch data.

Once the sync is broken, as the API do not provide information about workflows, instances or jobs, it is a black box, and it prevent to resync exported data from Zeebe data.

In the perspective to:

Debug workflow errors with Zeebe data and not exported data.
Easily export again Zeebe data to exporters.
Analyze the Zeebe DB black box content with the API.

After all, Zeebe behavior depends on Zeebe DB data which are consequently first-citizen data in the Zeebe eco-system.

Would it be relevant to add information queries about workflows/variables/instances/jobs from Zeebe DB to Zeebe API ?

jwulf · July 25, 2020, 10:16am

Maybe, but that is a big architectural change.

You could search the GitHub issues - there is a least one in there for deployed workflows - and add your use case to them / open a new one.

The development is driven by community input, and GitHub Issues are the best place to have that discussion.

If you do that, please drop a link back in here so others can also head over and contribute.

vtexier · July 25, 2020, 7:04pm

Just adding read-only queries on the API need to change the architecture ? I must have miss something here…

Thanks! Glad to here that! But I just want to discuss here about my thoughts before adding any Github issues. I want to be sure that my needs can’t be fulfilled without new developments.

I prepare a Proof of Concept project around Zeebe so I need to known a lot about Zeebe architecture and philosophy before validating the concept. What is working well, and what is not, what is possible, and what is not possible.

I have already explain a problem I am facing right now, where I have deleted exported data in Elasticsearch, keeping Zeebe data like precious ones.

I was thinking that deleting Elasticsearch data will “reset” Zeebe (I have read this silly advice in Github issues, but maybe I misread), but no, Zeebe store more than cluster information on disk, it also have its RocksDB filled with event logs, allowing to track state of workflows/instances/etc…

If Zeebe DB is the critical core data storing the state of all my micro-services orchestration, you can understand that I need to have an complete access to it. Events sent to exporters are not relevant to me as it is only a copy of the data and can be corrupted for whatever reason or just have a rolling delete to avoid flooding.

From a security backup point of view, I need to know where is the critical data, the one that allow me to recover and restart the system exactly from the state I was before an incident.

I can not sell a service to a customer where the critical data is a black box that I can not analyze/debug/save/restore/recover/repair.

[EDIT]
I have read all the Basics chapter (at last!) and have a better understanding of the internals of Zeebe.

To achieve a resync (get Zeebe to export current state to exporters), may be we can use the snapshot functionality of Zeebe.

Extract from the blog:

As the snapshots are used during recovery, its imperative that they are made durable, and are correctly replicated. The team therefore introduced checksum validation during snapshot replication, and improved the durability of snapshots on creation as well as during replication.

I see two possibilities here:

Send an API command to get or (better) send to exporters all the logs of the Zeebe partition.
Send an API command to get or send to exporters the snapshot of the Zeebe partition.

Doing that, we respect Zeebe philosophy:
Send command via API → store command event → execute command → send response to API and/or Export data to exporters.

Do you think this is realistic ?

jwulf · July 28, 2020, 12:48am

Yes, because the state is distributed. When you write to the cluster, the write request is either round-robin routed, or contains a partition key (like for a message targeting a specific workflow instance) that routes the API command to a single broker.

As soon as you read, you are now collecting data from the entire cluster. The gateway is stateless.

jwulf · July 28, 2020, 12:56am

Again, this is non-trivial to implement, as I understand it.

The events are exported at the moment that they occur. The entire system is event-sourced and the state is a reduction or projection of the event stream.

At the moment the state representation in Operate is a projection from the event stream in ES.

If you want Zeebe to collect the distributed projection from all the brokers and export that, that’s probably the easiest way to do it, so you would have an API command that queries the brokers in the topology to forward their projections.

However, you have several technical issues:

You now have a unique novel architectural pattern, multiple brokers responding to a single command - there is nothing that does that right now. At the moment it is client → gateway → specific broker → gateway → client. Now you introduce an entirely new class of “query” that needs to be implemented and debugged.
The projections are already out of date, because right behind your queued query are state-mutating commands. You would have to hydrate your state representation with the returned projection and fuse it with exported events immediately, so your external state projection is also complicated.

vtexier · July 28, 2020, 8:30am

Ok I understand better the all concept.

Yesterday, I managed to easily fix my use case.

The problem:

ES data is deleted to forget old workflows.
In the modeler, old workflows are overwritten with new IDs.
A new workflow is deployed.
A message is published, that trigger a new instance of the new workflow.
Problem: an old workflow with the same Start Event Message Name is triggered too!

That’s why I was thinking of resync the “state” of Zeebe with the exported “state” in ES. But it is not a good solution to this particular problem.

The solution:

In the modeler I have created a new workflow with the old ID, with an empty and disconnected Start Event.
I have deployed the new version of the old workflow.
I have published the same message, and now only the new workflow create an instance (thanks to the disconnected Start Event in the new version of the old workflow).

So, my advice here is to follow the philosophy of append-only stream of events in the use of Zeebe and ES.

Thank you very much for your explanation.

TangJiong · August 1, 2020, 8:19am

@jwulf Similar problems.
Zeebe does provide a way to query workflow/instance/variables status from Elasticseatch, but the biggest problem for me is the latency.

I read the elasticsearch exporter code and find that records are cached for some bulk seconds(default: 5s) before flushing to ES.
What’s worse, records can’t be queried until they reach the ES refresh_interval(default 30s).

Thus, it cost a lot of time before we can query the real current status which is unacceptable in our production case as we have to feedback the current status of the instance/instance variables in minisecond-level latency for some service(not Zeebe Worker).

Though, we can reduce the bulk seconds and refresh_interval to smaller proper value, but it’s far from our goal.

I wonder if it’s possible to export these event (aggregated status even better) to some in-memory storage and provide query api. In this way, we don’t have to change the Zeebe core architecture.

Appreciate for discussions and suggestions.

vtexier · August 1, 2020, 9:55am

There is already the Hazelcast exporter that you can test for your use case. It is a memory DB that is used by Zeebe simple-monitor.

But you can also develop your own exporter (they are implemented as plugins) if you have already a preferred memory DB.

Hope it helps.

Philipp_Ossler · August 3, 2020, 4:22am

You may also be interested in the ZeeQS project: https://github.com/zeebe-io/zeeqs
It imports the data via Hazelcast, aggregates it and provides a GraphQL API to query it.

system · January 31, 2024, 9:59am