I despise airflow and how cemented it is as data infrastructure. It such a usefu...

bushbaba · on Aug 2, 2022

The idea behind airflow is great. What sucks is people using it to do heavy processing. Maybe with serverless/k8s airflow could fan out the processing to a cluster to allow for flexibility. But then, I guess you end up re-writing spark et-al.

longcommonname · on Aug 3, 2022

Try gcsfuse to write logs to a bucket.

It's the one thing I like about our airflow. Everything else you said is echoed.

Also, the toil of dealing with many airflow instances when you have engineers who don't want to automate it.

zibarn · on Aug 2, 2022

Not experienced here but as a genuine interest can you tell what problems airflow solves that can't be handled by celery and rabbitmq?

Hippocrates · on Aug 2, 2022

I have not used celery + rabbitmq but I assume that combo is like sidekiq + redis, or any other job queue + worker system.

Airflow packages those things together and adds some additional features - UI with Graph, gantt, logs and other views of the workflow - Users and permissions - Places to store config - Mechanisms for passing small data between tasks - Various "sensors" for triggering workflows - Various operators that interact with common data-oriented systems (bigquery, snowflake, s3, you name it). These are basically libraries that expose a config-forward API.

Probably the main selling point is the pre-made operators, but in short it is a complete solution with bells and whistles that aligns itself with the data ecosystem.

code_biologist · on Aug 2, 2022

An analogy is "can you tell what problems Django solves that can't be handled by wsgi and psycopg?" Nothing fundamentally different, but life is a whole lot easier with Django. Honestly if you're doing data engineering and you haven't spent time with a good DAG runner, you're doing yourself a real disservice.

My sibling comment did a good job explaining, but the UI + configurable storage + configurable triggers all out of the box make life a lot easier.

metadat · on Aug 3, 2022

Django is easier when you want to do things only the "Django" way. However once you need something done differently it quickly shows its truly rigid and brittle self, and you'll find yourself fighting a great and challenging battle.

shrikant · on Aug 3, 2022

Perhaps unwittingly, you've hit upon people's exact frustrations with Airflow! :)

JoshCole · on Aug 3, 2022

Expressing the problem you are trying to solve as a DAG is idiomatic in Airflow, but expressing your problem in terms of queue processing is idiomatic in Celery.

    a b c d vs. a (bc) d

They make different design decisions about what to surface via UX and what to make easy as a consequence of thinking of the problems in terms of different data structures.

correlator · on Aug 2, 2022

Airflow with a celery backend is a pretty sweet combination. In that instance, airflow just gives you a nice scheduler to manage all the celery jobs.