Exploring Apache Airflow: Pros and Cons in Data Engineering
Written on
Should You Use Apache Airflow?
Why Data Engineers Love/Hate Airflow
Data pipelines are essential to the data infrastructure of any organization. A widely used framework for managing data extraction and transformation is Apache Airflow. Companies may rely entirely on Airflow and its various operators or use it to orchestrate other tools like Airbyte and dbt.
Airflow originated in 2014 at Airbnb to address the growing need for complex data pipelines. Its open-source nature quickly led to its adoption beyond Airbnb, as it satisfied many requirements data engineers had.
However, nearly ten years later, some of its shortcomings have become apparent, leading to the emergence of new frameworks in the Python data pipeline landscape, such as Prefect and Dagster.
Despite the quirks and limitations of Airflow, which often only become clear when attempting to scale it in a demanding data environment, many teams still depend on it. In this article, I aim to explore the reasons why data engineers have mixed feelings about Airflow by interviewing several professionals about their experiences.
Where We Started
In my conversations with various data engineers regarding their initial data pipeline solutions, I expected to find some common ground. I thought many would have started with systems like SSIS, cron, or bash scripts.
Surprisingly, responses varied widely. Some began with Airflow, others with SSIS, and some even with lesser-known data integration tools. Here are a few insights:
- Joseph Machado — Senior Data Engineer @ Linkedin: “I started running Python scripts locally.”
- Sarah Krasnik — Founder of Versionable: “I’ve actually only ever used Airflow.”
- Matthew Weingarten — Senior Data Engineer @ Disney Streaming: “At my first company, we used TIBCO as our enterprise data pipeline tool.”
- Mehdi (mehdio) Ouazza — @mehdio DataTV: “Airflow was the first tool I used running an on-prem system.”
With countless data pipeline solutions available today, it has always been a challenge to determine the best method for transporting data from one point to another, whether through custom solutions or no-code options.
Why We Like Airflow
So what makes Airflow so appealing in the data engineering realm? Quoting Maxime Beauchemin, the creator of Airflow, from a 2015 article:
> “As a result of using Airflow, the productivity and enthusiasm of people working with data has been multiplied at Airbnb.”
Airflow emerged as a productivity booster during a time when data engineering was inundated with one-off requests and constant migrations. Given that Airbnb developed it for their own needs, it’s worth examining why other data engineers have adopted it as well.
Easy To Start
One of Airflow's strengths is how straightforward it is to create your first Directed Acyclic Graph (DAG). You simply need to write a few parameterized operators and then run:
airflow standalone
In no time, you have Airflow up and running, either locally or on an EC2 instance. While this setup works fine for the initial months, it’s important to consider scaling and potential storage issues later on.
Scheduling
Airflow features a user-friendly scheduler. With a cron-based scheduling system, developers can set their DAGs to run at various intervals—daily, hourly, weekly, or anything in between. This integration means no need to dive into cron for updates; scheduling becomes part of your code.
This capability is particularly advantageous for those who have previously struggled with SQL Server Agent issues. With Airflow, the scheduling component is seamlessly integrated.
Dependency Management
Implementing dependency management in a custom data pipeline can be complex. In the past, some data engineers might have relied on setting intervals between tasks, but that approach doesn’t truly manage dependencies—if one task fails, the next may still proceed.
For me, this was a key reason for adopting Airflow. It allows for a clear depiction of task dependencies, ensuring that dependent tasks do not execute if a prior task fails. This is represented in Airflow's DAG paradigm, which visually outlines task relationships. The UI facilitates tracking and rerunning tasks at failure points without needing to restart the entire pipeline, explaining Airflow's rapid rise in popularity.
Other Reasons Developers Choose Airflow
Numerous additional factors attract developers to Airflow, including:
- SQL Templating
- Strong Open Source Community
- A Good Balance Between Predefined Components and Custom Code
Downside
Despite its popularity, Airflow has notable downsides, especially when trying to manage it independently in production.
Scaling Challenges
Scaling Airflow can be quite difficult, as many interviewees acknowledged. Often, the responsibility for managing the infrastructure falls to DevOps or Data Infrastructure teams, as ensuring Airflow's smooth operation becomes a dedicated task.
Similar to Hadoop, Airflow necessitates various auxiliary services for successful operation, particularly at scale. Those interested in scaling Airflow should consult its architecture and related resources.
While Airflow's ease of creating basic DAGs may draw users in, neglecting the need for scalable architecture as job demands grow can lead to significant issues down the line.
Clunky Data Passing Between Tasks
When discussing limitations, Sarah Krasnik highlighted:
> “Passing information between tasks... The concept of XComs is just so clunky, buggy, and really hard to get right.”
I experienced similar frustrations with data passing in Airflow while using its predecessor, Dataswarm. Often, I had to write data to a text file in one task and read it in the subsequent one, which felt cumbersome.
Airflow does offer XComs for data transfer, but, as Sarah noted, it can be challenging to implement effectively. While this may be partly by design, the need for seamless data transfer is a recurring hurdle for many users.
In summary, while Airflow has numerous strengths, it is far from flawless, and many users have learned to navigate its limitations.
Other Options
During my interviews, many engineers mentioned Prefect and Dagster as alternatives. Both frameworks aim to address the shortcomings of Airflow, with Dagster and Prefect improving upon Airflow's unique challenges.
If you're considering alternatives, these two options may be worth exploring.
Airflow 2.X
Airflow continues to be a leading framework for organizations looking to scale their data pipeline infrastructure. Its user-friendly nature and robust community support suggest that it will remain influential in the data engineering field, particularly with managed solutions and improved best practices.
Is Airflow the right choice for you? Ultimately, it depends on factors like your team's size, skills, and budget.
What has been your experience? Have you utilized Airflow, or do you prefer a different solution?