Running in Airfow

This tutorial shows you how to export a Ploomber pipeline using Soopervisor.

Why Ploomber?

Ploomber and Airflow’s main difference is that Ploomber focuses on the development stage. You can write your tasks as Python functions, Jupyter notebooks, SQL scripts, or even R scripts, and Ploomber will take care of converting the collection of functions/scripts into a pipeline. Furthermore, it allows you to debug and test locally, with no need for extra infrastructure.

Airflow is a general-purpose tool aimed to cover more scenarios; it is an excellent tool for running workflows in production but provides little help for development.

Using Ploomber for development and Airflow for production, gets you the best of both worlds: a great development experience and a production orchestrator.

Exporting a Ploomber pipeline using Soopervisor

Generating an Airflow pipeline from a Ploomber one is as simple as installing Soopervisor and running one command:

pip install soopervisor
soopervisor export-airflow --output airflow/

Once the export process finishes, you’ll see a new airflow/ folder with two subfolders dag/, which contains the Airflow DAG definition and ploomber/ which contains your project’s source code. To deploy, move those directories to your AIRFLOW_HOME.

If you don’t pass the --output parameter, it will export the project to AIRFLOW_HOME.

Generated Airflow DAG

The generated Airflow pipeline consists of BashOperator tasks, one per task in the original Ploomber pipeline. Each task runs a script that creates a conda virtual environment and runs the task.

The generated file is simple, and you can customize it to your needs by using Airflow’s API directly.

NOTE: we are working on adding more features such as using other types of operators for exported tasks. Let us know what features we are missing by opening an issue in the repository.

Requirements in the Airflow host

For Airflow to parse the DAG, it must have soopervisor installed. To execute the pipeline, it must have conda installed.

Examples

The sample projects repository contains a few example pipelines that can be exported to Airflow:

Before running the examples, make sure you have an Airflow installation available. Check out Airflow’s documentation for instructions.

git clone https://github.com/ploomber/projects
cd projects/

# set your airflow home (this value might be different to yours)
export AIRFLOW_HOME=~/airflow

# export a few projects
cd ml-intermediate
soopervisor export-airflow

cd etl
soopervisor export-airflow

You should see the exported ml-intermediate and etl projects as DAGs.

Storing pipeline artifacts

It’s common to store the artifacts generated by your pipeline (data files, trained models, etc) for later consumption. We currently support uploading to Google Cloud Storage, click here to see an example.

If all tasks execute in the same machine, this is optional. If you’re using a distributed executor (e.g. celery), storing your pipeline alrtifacts guarantees that downstream tasks have access to their inputs (which are the outputs from upstrem tasks).

Optional: Parametrizing your pipeline

Say you are developing a pipeline, you might choose a folder to save all outputs (such as /data/project/output. When you deploy to Airflow, it is unlikely that you have the same filesystem; hence, you would choose a different folder (say /airflow-data/project/output). How do you deal with these two configurations?

Ploomber provides a clean way of achieving this using parametrization, the basic idea is that you can parametrize where your pipeline saves its output.

To achieve this, Soopervisor looks for an env.airflow.yaml file used to load your Airflow pipeline. This way, you can keep development and production configurations separated.

One important thing to keep in mind is that when using Airflow, you should not store pipeline product’s inside the project’s root folder because this can negatively impact Airflow’s performance, which continuously scans folders looking for new pipeline definitions. To prevent this from happening, Soopervisor analyzes your pipeline during the export process and shows you an error message if any pipeline task will attempt to save files inside the project’s root folder.