Elastic Airflow: A utility that accelerates job processing by scaling the workers in congruence with workload

Introduction

Data analytics and automation is the current need of business processes. Implementing real time solutions is a serious challenge given the dynamic nature of the data that is being generated in today’s time. Proper orchestration, scheduling, managing and monitoring the data pipelines is a crucial task for any data platform to be stable and reliable.

Work Load Management has become so crucial that companies are finding different ways to create, schedule jobs internally and to automate the processes. Airflow has emerged as an attractive workflow management platform for implementing solutions catering to data analytics. Dynamic pipeline generation, usage of Directed Acyclic Graph (DAG), parameterization of scripts, dynamic orchestration support for Celery, Service level agreements (SLAs), great user interface (UI) for monitoring metrices, support for calendar schedule as well as Crontab scheduling, ability to rerun a Directed Acyclic Graph (DAG) instance in case of a failure and scalability makes it a good choice for workload management solutions. Data warehousing, growth analytics, experimentation, email targeting, sessionization, search, data infrastructure maintenance is some of the processes fueled by airflow.

The basic idea of the project is to devise a Dynamic Workload Management mechanism for Airflow (Elastic Airflow), so that it can scale up and down based on scheduled jobs in conjunction with applied limits. This paper discusses Airflow and how the concept of Elastic Airflow would be beneficial for the business requirements.

Existing System and its Drawbacks

Airflow works in a distributed environment. The airflow scheduler schedule jobs according to the dependencies defined in Directed Acyclic Graphs (DAGs). The scheduler periodically polls to see if any DAGs which are registered needs to be executed. If a specific DAG is to be triggered, then the scheduler creates a new DAG Run instance in the metastore and starts to trigger the individual tasks in the DAG. The scheduler will then push messages into the queuing service. A message contains information about the tasks to execute (DAG_id, task_id) and the functions that needs to be performed. In some cases, the user will interact with the web server. User can manually trigger a DAG to be run. A DAG Run is then created, and the scheduler will trigger individual tasks in the same way as described above.

Airflow uses celery to execute several tasks concurrently on several workers server using multiprocessing. Celery is an asynchronous queue based on distributed message passing.  The job information is stored in the meta database, which is updated in a timely manner. We can monitor their jobs via airflow web UI and/or the logs. When the workload is high, airflow starts to run out of resources. To accommodate this spike in workload, we must add workers manually to scale up airflow. When the workload is low, airflow must check the status of the worker nodes and then delete the worker node depending on its status. These processes require manual intervention in the form of rigorous and extensive monitoring which is time consuming and not cost effective for the business processes.

Implemented Architecture

Figure 1: Architecture of Elastic Airflow

Technology Stack

Airflow is a platform to programmatically author, schedule and monitor workflows. It also authors workflows as Directed Acyclic Graphs (DAGs) of tasks. The airflow scheduler executes the tasks on an array of workers while following the specified dependencies. The UI makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Celery Executor is one of the ways to scale out the number of workers. It is required to setup a Celery backend and change configuration to point the executor parameter to Celery Executor and provide the related Celery settings.

Compute Machine is a web service that provides secure and resizable compute capacity in the cloud. EC2 and Compute Engines are examples in AWS and GCP respectively.

RabbitMQ is the most widely deployed open source message broker. It is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols.

Cloud Functions is a fully managed serverless execution platform for building and connecting cloud services which helps in writing simple, single-purpose functions that can be triggered when an event being watched is fired. Leveraging cloud functions can eliminate the need for provisioning and managing the servers. It provides a connective layer of logic that lets you write code to connect and extend cloud services. Cloud Functions can also be replaced by AWS Lambda when it is deployed using AWS cloud.

PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads.

Figure 2: Workflow

Working of the Implemented System

  1. Schedulers will trigger the Cloud Function once in every 5 minutes.
  2. As soon as Cloud Function is triggered, it will check whether all the existing worker nodes are busy. If all the worker nodes are busy and queue length is greater than zero, it will add an additional worker node to the cluster on a condition that number of worker nodes is not greater than max worker nodes (pre-defined variable).
  3. And if the queue is empty, it will check if any of the worker nodes is idle (not running any tasks) and sends signal to stop consumption to related worker node and then deletes the worker node from the worker pool. Also, it sends signal to stop consumption to the ones that are running tasks if the queue is empty to ensure that worker doesn’t pick any tasks from the queue before it gets deleted.
  4. Also, whenever a new worker is added or deleted, alerts and notifications are sent to the owners.
  5. However, there is a default worker node, to which the stop consuming signal is not sent and that worker is not deleted just to ensure that even if all the workers are deleted when the queue is empty, this default worker node picks newly arrived tasks.

Advantages of Implemented System

Automations help to streamline business processes with less human intervention and to eradicate chances of human error.  Airflow’ setup through the Apache ladder is a sign that the project follows the processes and principles laid out by the software foundation. Thus, the Implemented system is highly cost effective and reduces the need for manual intervention. Typically, the solution grows reactively as a response to the increasing need to schedule individual jobs.

Parameterization and exchanging the parameters dynamically, exception handling, and eliminating the hardcoding to the highest extent possible, makes it a good deployable solution. This solution can be ported to other cloud solution providers and thus has a broader scope.

Conclusion

This paper presents a framework for airflow applications and how automation is the near future. It devises a cost-effective solution to reduce the unnecessary manual intervention and to reduce rigorous monitoring required in scaling the worker node depending on the requirement. The master node automation could add more to the effectivity of the solution. We can also migrate this solution on to other cloud providers.

About the Author

Murali Sathenapalli, Cloud Engineer

Murali is a Cloud Engineer at Core Compete and enjoys working on, architecting and deploying sophisticated data and analytics workloads into elastic, scalable cloud platforms.

Related Posts