Running in Airfow¶
This tutorial shows you how to export a Ploomber pipeline using Soopervisor.
Ploomber and Airflow’s main difference is that Ploomber focuses on the development stage. You can write your tasks as Python functions, Jupyter notebooks, SQL scripts, or even R scripts, and Ploomber will take care of converting the collection of functions/scripts into a pipeline. Furthermore, it allows you to debug and test locally, with no need for extra infrastructure.
Airflow is a general-purpose tool aimed to cover more scenarios; it is an excellent tool for running workflows in production but provides little help for development.
Using Ploomber for development and Airflow for production, gets you the best of both worlds: a great development experience and a production orchestrator.
Exporting a Ploomber pipeline using Soopervisor¶
Generating an Airflow pipeline from a Ploomber one is as simple as installing Soopervisor and running one command:
pip install soopervisor soopervisor export-airflow --output airflow/
Once the export process finishes, you’ll see a new
airflow/ folder with
dag/, which contains the Airflow DAG definition and
ploomber/ which contains your project’s source code. To deploy, move
those directories to your AIRFLOW_HOME.
If you don’t pass the
--output parameter, it will export the project to
Generated Airflow DAG¶
The generated Airflow pipeline consists of
BashOperator tasks, one
per task in the original Ploomber pipeline. Each task runs a script that
creates a conda virtual environment and runs the task.
The generated file is simple, and you can customize it to your needs by using Airflow’s API directly.
NOTE: we are working on adding more features such as using other types of operators for exported tasks. Let us know what features we are missing by opening an issue in the repository.
Requirements in the Airflow host¶
For Airflow to parse the DAG, it must have
soopervisor installed. To
execute the pipeline, it must have
The sample projects repository contains a few example pipelines that can be exported to Airflow:
Before running the examples, make sure you have an Airflow installation available. Check out Airflow’s documentation for instructions.
git clone https://github.com/ploomber/projects cd projects/ # set your airflow home (this value might be different to yours) export AIRFLOW_HOME=~/airflow # export a few projects cd ml-intermediate soopervisor export-airflow cd etl soopervisor export-airflow
You should see the exported
etl projects as DAGs.
Storing pipeline artifacts¶
It’s common to store the artifacts generated by your pipeline (data files, trained models, etc) for later consumption. We currently support uploading to Google Cloud Storage, click here to see an example.
If all tasks execute in the same machine, this is optional. If you’re using a distributed executor (e.g. celery), storing your pipeline alrtifacts guarantees that downstream tasks have access to their inputs (which are the outputs from upstrem tasks).
Optional: Parametrizing your pipeline¶
Say you are developing a pipeline, you might choose a folder to save all
outputs (such as
/data/project/output. When you deploy to Airflow, it is
unlikely that you have the same filesystem; hence, you would choose a different
/airflow-data/project/output). How do you deal with these two
Ploomber provides a clean way of achieving this using parametrization, the basic idea is that you can parametrize where your pipeline saves its output.
To achieve this, Soopervisor looks for an
env.airflow.yaml file used to load your Airflow pipeline. This way, you can keep development
and production configurations separated.
One important thing to keep in mind is that when using Airflow, you should not store pipeline product’s inside the project’s root folder because this can negatively impact Airflow’s performance, which continuously scans folders looking for new pipeline definitions. To prevent this from happening, Soopervisor analyzes your pipeline during the export process and shows you an error message if any pipeline task will attempt to save files inside the project’s root folder.