Airflow 101: An Introduction to the Workflow Management System

Apache Airflow Logo

What is Airflow?

Apache Airflow is an open-source tool for developing, scheduling, and monitoring batch workflows. First developed by Airbnb in October 2014 as a solution to manage its complex workflows, it is now under the Apache Software Foundation. It is one of the popular tools nowadays to orchestrate workflows for data engineers. Airflow is developed in Python. Its workflows are defined in Python code too. The user-friendly web interface makes it easy to monitor and troubleshoot workflows.

When to use and when not to use Airflow?

Airflow, A tool for orchestrating and scheduling batch workflows. There are numerous built-in operators and sensors that make it simple to extend and fast start using airflow. Some of the common applications are Reports Scheduler, DevOps Tasks, and Automation Tasks. Airflow is not a streaming solution and was not designed to operate event-based processes indefinitely. 

Airflow is developed to have workflow as code. So coding is a mandatory prerequisite to getting hands-on. If you are more of a graphical interface user, Airflow is not the right solution for you.

Airflow’s workflow as code has its own advantages:

  • Allows for dynamic pipeline generation.
  • Workflow parameterization is built-in leveraging the Jinja templating engine.
  • It becomes easier for teams to collaborate on workflow design and implementation.

Airflow is not intended for use as a tool for data processing. It must be used to trigger and keep track of data processing processes that run on external systems. It is advised that just a little amount of data be sent between jobs using XComs.

Important Core Concepts:

Workflow is defined as a DAG (Dynamic Acyclic Graph).  A DAG is a collection of tasks with dependencies between each task.

A Task is a unit of work within a DAG. There are three common types of tasks:

  • Operators are predefined task templates that can be used to build the DAGs. Operators perform a specific action. There are many built-in operators like PythonOperator, BashOperator, etc. It’s possible to write your own customized operator.
  • Sensors are a special type of Operator that wait for an event to happen. It can be time-based, or waiting for a file or an external event. Airflow has a large set of pre-built sensors some of them are HttpSensor and TimeDeltaSenor.
  • Task flow – decorated @task is a custom Python function packaged as a task.

All these are internal subclasses of Airflow’s BaseOperator. When you call an operator/sensor in a DAG file, you make a Task.

A visual representation of DAG, Tasks, and Operators.

A visual representation of DAG, Tasks, and Operators.

An instantiation of the DAG, containing task instances for a specific execution date is DagRun. A DAG can be triggered by an external trigger or by a scheduler.


A task instance is defined as the union of a DAG, a task, and a point in time (execution date) and represents a particular job run. Task instances also have an indicative state, which could be running, success, failed, skipped, up for retry, etc.

For more detailed information about other core concepts, visit Airflow core concepts

Airflow’s Architecture:

Airflow Architecture involving core components

Airflow Architecture involving core components

The scheduler is responsible for scheduling jobs and determines which tasks to run where and when.

Webserver is a Flask server running with Gunicorn, which presents a user-friendly interface to trigger, monitor and troubleshoot DAGs (i.e. workflows) and tasks.

The executor handles running tasks, runs within the scheduler, and pushes the task execution to workers. In the default Airflow installation, this runs everything inside the scheduler.

Metadata Database All the data relating to DAGs and Tasks are kept in the metadata database. The scheduler, executor, and web server all use this to hold the state. There is support for Postgres, Mysql, MsSQL, and SQLite 

Workers are the process that executes the tasks defined by the executor. Depending on the executor type, an Airflow Setup may or may not include workers. The scheduler searches for dags in the DAGs Directory, which is where all of the dags are kept. The airflow.cfg file, which has a default location of the airflow/dags folder, can be used to configure it.

Conclusion

I hope you find this post informative and helpful. In this article, we learned what airflow is, when to use it and when not to, how it’s built, and core concepts to get started with. Now that you have a good understanding of Apache Airflow’s fundamentals, Be ready to learn more about Airflow’s capabilities. In the next article, we will discuss how to set up airflow on your machine and run your first airflow dag.

If you have any questions or comments, please feel free to leave them below.

Leave a Reply

Your email address will not be published. Required fields are marked *