I have a requirement where we have to run event-driven ETL pipelines with complex dependencies. We realized that we cant use something like Airflow since it has mostly time-based triggers.
We receive different type of source files into Google Storage buckets in different timing.These files include some meta-data to identify the “batch number”.
As shown in the diagram, IF0 ,IF1 ,IF2 and IF3 are these files.I am thinking of sending these “file put” events to Kafka (via GS Cloud Functions) and use a workflow manager like Zeebe for task orchestration .
Dependencies are as shown in the diagram : “Transformation 1” should be executed only when both of the IF1 and IF2 files arrived in the GS bucket…etc
“Transform1”,“Transform2”…etc can be long running ETL jobs like : Spark jobs, ETL queries on BigQuery…etc
Zeebe will be a good fit for this use case?