Data Pipelines Using Apache Airflow

I previously wrote a blog and demo discussing how and why data engineers should deploy pipelines using containers. One slight disadvantage to deploying data pipeline containers is the managing, monitoring, and scheduling of these activities can be a little bit of a pain. One of the most popular tools out there for solving this is Apache Airflow. Apache Airflow is a platform to programmatically develop, schedule, and monitor workflows. Workflows are defined as code, making them easy to maintain, test, deploy, and collaborate across a team.

At the core of Apache Airflow are workflows that are represented as Directed Acyclic Graphs (DAGs) that are written mainly in Python or Bash commands. DAGs are made up of tasks that can be scheduled on a specific cadence, and can be monitored using the built in Airflow Webserver with an interface that looks like the following:

Generally, I recommend two methods of using Airflow for monitoring and scheduling purposes with containers in Azure.

  1. DAG
  2. RESTful

Developing your data pipelines as DAGs makes it easy to deploy and set a schedule for your jobs. Engineers will need to write a data pipeline Python script to extract, transform, or move data. A second script that imports our data pipeline into a DAG to be ran on a specific cadence. An example of this would be the hello world example I have provided. While the development and integration of data pipelines in Azure is easier when created as DAGs, it requires the developer to deploy all their pipelines to the same Azure Container Instance or Kubernetes Cluster.

Deploying data pipelines as RESTful web services allows developers to decouple scheduling from the data pipeline by deploying a web service separate from your Apache Airflow deployment. Separate deployments would simply require a developer to write a DAG to call your web service on the schedule you wish. This is a great way to off load the compute and memory required to from your airflow server as well. The one draw back is that this adds a little more work to handle web service secrets but once it is handled it is easy to repeat and use across all your data pipelines. An example of this can be found with my Restful deployment example. While the Azure Machine Learning Service is geared toward deploying machine learning models as a web service, it can be used to deploy data pipelines as well allowing the developer to offload and authentication and security required when developing a web service.

Overall, I have seen organizations develop home grown scheduling and monitoring techniques in order to capture all the metadata required to ensure your data pipelines are running properly. Apache Airflow makes this process easy by offering a great built-in user interface to visualize your data pipelines, and provides a database for developers to build additional reporting as needed.

Check out the demo I created walking engineers through the development and deployment of data pipelines in Azure using Apache Airflow!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s