Zeebe for event-driven ETL pipelines?

Greetings,

I have a requirement where we have to run event-driven ETL pipelines with complex dependencies. We realized that we cant use something like Airflow since it has mostly time-based triggers.

We receive different type of source files into Google Storage buckets in different timing.These files include some meta-data to identify the “batch number”.
As shown in the diagram, IF0 ,IF1 ,IF2 and IF3 are these files.I am thinking of sending these “file put” events to Kafka (via GS Cloud Functions) and use a workflow manager like Zeebe for task orchestration .

Dependencies are as shown in the diagram : “Transformation 1” should be executed only when both of the IF1 and IF2 files arrived in the GS bucket…etc

“Transform1”,“Transform2”…etc can be long running ETL jobs like : Spark jobs, ETL queries on BigQuery…etc

Zeebe will be a good fit for this use case?

1 Like

Yes, you can use Zeebe to do this. You would probably model the file arrivals as message catch events.

To see if it is a good fit, I recommend modelling it in BPM with the Zeebe Modeler.

The key issue is that a stateful process needs a correlation key, so there must be some unique id that is shared by the various events that belong to the same process, so that they can be correlated to the same process instance.

Josh

1 Like

Thank you for the reply.

I assume the BPM workflows are transformed into Kafka Stream applications internally by zeebe ?

If so, for joins (ex: joining IF1 and IF2 in my diagram) it performs joins between KStreams.
So we need to specify a time-window for this kind of joins between two unbounded KStream ?

You need to look at the Zeebe Kafka connector. This is an integration that is external to the engine’s core functionality.

(In)validate all of your assumptions.

1 Like