I despise airflow and how cemented it is as data infrastructure. It such a useful and basic concept but a nightmare to manage, and it works like junk. It's taken me 3 separate jobs over 7 years to realize that it's probably not our fault. Everyone seems to struggle with the same things: flaky scheduler that is slow to run tasks, confusing and redundant sounding settings that apply at up to three different levels (environment, job, task). It invites less experienced users to write a sea of spaghetti code in a monolithic DAGs repo. People wind up doing heavy data munging in python operators, which clobbers scalability and reliability. It also can't handle a large number of parallel tasks or frequent runs. It seems to have miserable scalability for the resources given, and bad controls for auto scaling. The UI feels dated and unintuitive. XComs seem useful to everyone but work like crap and actually an anti-pattern.
I've also tried it on Cloud Composer (google managed) and automated upgrades always trashed the cluster. It's not well designed for GKE because it writes logs to files and requires stateful sets. Testing the code is a huge burden due to the vast environment and dependencies needed to make it work locally.
I'm eager to rid my life of it and test out temporal for some of the high concurrency/frequency cases we have.
The idea behind airflow is great. What sucks is people using it to do heavy processing. Maybe with serverless/k8s airflow could fan out the processing to a cluster to allow for flexibility. But then, I guess you end up re-writing spark et-al.
I have not used celery + rabbitmq but I assume that combo is like sidekiq + redis, or any other job queue + worker system.
Airflow packages those things together and adds some additional features
- UI with Graph, gantt, logs and other views of the workflow
- Users and permissions
- Places to store config
- Mechanisms for passing small data between tasks
- Various "sensors" for triggering workflows
- Various operators that interact with common data-oriented systems (bigquery, snowflake, s3, you name it). These are basically libraries that expose a config-forward API.
Probably the main selling point is the pre-made operators, but in short it is a complete solution with bells and whistles that aligns itself with the data ecosystem.
An analogy is "can you tell what problems Django solves that can't be handled by wsgi and psycopg?" Nothing fundamentally different, but life is a whole lot easier with Django. Honestly if you're doing data engineering and you haven't spent time with a good DAG runner, you're doing yourself a real disservice.
My sibling comment did a good job explaining, but the UI + configurable storage + configurable triggers all out of the box make life a lot easier.
Django is easier when you want to do things only the "Django" way.
However once you need something done differently it quickly shows its truly rigid and brittle self, and you'll find yourself fighting a great and challenging battle.
Expressing the problem you are trying to solve as a DAG is idiomatic in Airflow, but expressing your problem in terms of queue processing is idiomatic in Celery.
a b c d vs. a (bc) d
They make different design decisions about what to surface via UX and what to make easy as a consequence of thinking of the problems in terms of different data structures.
I've also tried it on Cloud Composer (google managed) and automated upgrades always trashed the cluster. It's not well designed for GKE because it writes logs to files and requires stateful sets. Testing the code is a huge burden due to the vast environment and dependencies needed to make it work locally.
I'm eager to rid my life of it and test out temporal for some of the high concurrency/frequency cases we have.