Running in Kubernetes

Soopervisor can export Ploomber projects to run in Kubernetes via Argo.

Argo

Argo is a general-purpose framework to execute, schedule and monitor workflows in Kubernetes. Argo workflows are written in YAML and it requires you to specify each task in your pipeline, task dependencies, script to run, Docker image to use, mounted volumes, etc. This implies a steep (and unnecessary) learning curve for a lot of people who can benefit from it.

Soopervisor can export Ploomber projects to Argo’s YAML spec format. This way users can develop and their workflows locally and think in terms of functions, scripts and notebooks, not in terms of clusters nor containers. When they’re ready, they can scale the workload in a cluster.

A Ploomber workflow can be specified via a YAML spec (although there is also a Python API for advanced use cases), which only requires users to tell what to run (function/script/notebook) and where to save the output:

# Ploomber's "pipeline.yaml" example

tasks:
# tasks.get, features and join are python functions defined in tasks.py
- source: tasks.get
  product: output/get.parquet

- source: tasks.features
  product: output/features.parquet

- source: tasks.join
  product: output/join.parquet

# fit.ipynb is a notebook
- source: fit.ipynb
  product:
    # where to save the executed copy
    nb: output/nb.ipynb
    # and any other generated files
    model: output/model.pickle

Execution order is inferred by building a directed acyclic graph through static analysis in the source code.

Click here to see the full code example.

Overview

Once you have a Ploomber project ready, generate the Argo spec using:

soopervisor export

This command runs a few checks to make sure your pipeline is good to go, and then generates the Argo’s YAML spec. Before running your workflow, you have to make sure the source code is available in the Pods, Soopervisor implements a simple approach: it uses kubectl to directly upload the code to a cluster shared disk. However, you can customize this process to suit your needs.

To upload the code using the cli:

soopervisor export --upload

Once you upload your project. You can execute the workflow with:

argo submit -n argo argo.yaml

By standardizing the deployment process (how to upload the code, which image to use, how to mount volumes and how to install dependencies), end-users are able to leverage Argo without having to modify their Ploomber projects.

For options to configure the exported DAG and to set up code uploading see: Argo API.

Technical details

This section describes the detailed process of interfacing Ploomber projects with Argo/Kubernetes with a complete example

argo submit triggers a workflow by just uploading a YAML file, but it does not take care of uploading anything else such as the project’s source code. This implies that to execute a Ploomber project you have to ensure that 1) project’s source code is available on each Pod and 2) Pods can get their input data (which is generated by previous tasks).

Soopervisor implements a simple deployment workflow but you can customize it to suit your needs. Implementation details and possible customizations are described in the following sections.

Generated Argo spec

soopervisor export analyzes your pipeline and automatically generates the Argo YAML spec. This involves generating one entry in the spec per pipeline task and setting the same graph structure by indicating the dependencies for each task.

Each Pod runs a single task using the continuumio/miniconda3 image by default. The script executed on each Pod sets up the conda environment using the user-provided environment.yml file, then executes the given task.

By default, the spec mounts persistent volume clain (PVC) with name nfs and mounts folder /exports/{project-name} from such PVC to /mnt/nfs on each Pod, where {project-name} is replaced by your project’s name (the name of the folder that contains your pipeline.yaml file). Tasks are executed with /mnt/nfs as the working directory.

The mounting logic can be customized using a soopervisor.yaml configuration file, see the Argo API. for details.

Uploading project’s source code

A Ploomber project is composed of a conda environment.yml, pipeline.yaml and source code files (.py, .sql, .R, etc). The simplest way to make the source code available to every Pod is to upload your code to a persistent volume and mount it on every Pod when it starts execution.

soopervisor export --upload

To enable the use of the --upload flag, you have to configure the code_pod section in the soopervisor.yaml configuration file, see the Argo API. for details.

The primary disadvantage of uploading the code directly is that there is no control over pipeline versions, a different approach is to generate a package from your project (each time with a different version number), upload it to a package registry and have the pods pull the project from the registry. Another approach would be to fetch the source code from a repository.

Input data

During pipeline execution, tasks get their inputs from previous tasks (also known as upstream dependencies). When running a pipeline in a single machine, this works fine because all files are saved to the same filesystem. When running in Kubernetes, each Pod has its own filesystem.

The simplest solution is to mount a shared disk and have all tasks write their outputs to the shared resource. This reduces the need to move large datasets over the network.

Although simple, this approach is unfeasible if the cluster spans several cloud regions and it isn’t possible to mount a shared disk on all pods. An alternative approach is to have each task fetch its inputs over the network before execution.

The current implementations assumes all tasks write to a shared disk, mounting logic can be configured using a soopervisor.yaml file.

Full example

Option 1: minikube

Install kubectl and minikkube.

Part 1: create a Kubernetes cluster and install Argo

# by default it creates a 20GB disk, which is too much for this example
minikube start --disk-size 10GB

# install argo
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/quick-start-postgres.yaml

Submit a sample workflow to make sure Argo is working:

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml

Part 2: Add a shared folder

# create a folder to share data with the cluster
mkdir $HOME/minikube

# mount shared folder
minikube mount $HOME/minikube:/host

Part 3: Execute Ploomber sample projects

Open a new terminal, to enable Argo’s UI:

# port forwarding to enable the UI
kubectl -n argo port-forward svc/argo-server 2746:2746

Then open: http://127.0.0.1:2746

Open a new terminal, let’s now run a Ploomber sample pipeline, which consists of a few tasks that prepare data and train a machine learning model:

# get the sample projects
git clone https://github.com/ploomber/projects

# copy source code to the shared folder
# (recommended: ml-basic/ (machine learning pipeline) and etl/)
cp -r projects/ml-basic $HOME/minikube

# generate argo spec
cd projects/ml-basic

# uncomment the "config for minikube" section in soopervisor.yaml

soopervisor export

# submit workflow
argo submit -n argo --watch argo.yaml

You can also watch progress from the UI.

Once execution is finished, you can take a look at the generated arfifacts:

ls $HOME/minikube/output/

To delete the cluster:

minikube delete

Option 2: Google Cloud

This section is a complete example to run a Ploomber project in Kubernetes using Google Cloud. gcloud and kubectl are configured.

Part 1: create a Kubernetes cluster and install Argo

# create cluster
gcloud container clusters create my-cluster --num-nodes=1 --zone us-east1-b

# install argo
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/quick-start-postgres.yaml

Submit a sample workflow to make sure Argo is working:

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml

Part 2: Add a shared disk (NFS)

# create disk. make sure the zone matches your cluster
gcloud compute disks create --size=10GB --zone=us-east1-b gce-nfs-disk

# configure the nfs server
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/01-nfs-server.yaml
kubectl apply -f 01-nfs-server.yaml

# create service
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/02-nfs-service.yaml
kubectl apply -f 02-nfs-service.yaml

# check service
kubectl get svc nfs-server

# create persistent volume claim
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/03-nfs-pv-pvc.yaml
kubectl apply -f 03-nfs-pv-pvc.yaml

# run sample workflow (uses nfs and creates an empty file on it)
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/dag.yaml
argo submit -n argo --watch dag.yaml

Container see the contents of the shared drive /export/ directory at /mnt/nfs.

Check the output of dag.yaml:

# get nfs-server pod name
kubectl get pod

# replace with the name of the pod
kubectl exec --stdin --tty {nfs-server-pod-name} -- /bin/bash

Once inside the Pod, run:

ls /exports/

You should see files A, B, C, D. Generate by dag.yaml.

Part 3: Execute Ploomber sample projects

Enable Argo’s UI:

# port forwarding to enable the UI
kubectl -n argo port-forward svc/argo-server 2746:2746

Then open: http://127.0.0.1:2746

Run a Ploomber sample pipeline, which consists of a few tasks that prepare data and train a machine learning model:

# get the sample projects
git clone https://github.com/ploomber/projects

# get nfs pod name
kubectl get pods -l role=nfs-server

# upload source code to the nfs server
# (recommended: ml-basic/ (machine learning pipeline) and etl/)
kubectl cp projects/ml-basic {nfs-server-pod-name}:/exports/ml-basic

# generate argo spec
soopervisor export

# submit workflow
argo submit -n argo --watch argo.yaml

Alternatively, you can use the --upload flag

Save the following soopervisor.yaml file:

code_pod:
  args: -l role=nfs-server
  path: /exports/{{project_name}}

To execute the workflow:

# generate argo spec and upload source code
soopervisor export --upload

# submit workflow
argo submit -n argo

You can keep track of execution by opening the GUI.

Once execution is finished, you can take a look at the generated arfifacts:

# get pod names
kubectl get pod

# ssh to nfs pod, replace {pod-name} with your nfs pod name
kubectl exec --stdin --tty {pod-name} -- /bin/bash

# output folder
cd /exports/ml-basic/output/

Make sure you delete your cluster after running this example.

Other examples to try

You can execute other examples from the same repository in the same way:

1. ml-intermediate - A bit more sophisticated ML example, showing how to execute integration tests upon task execution and parametrize your pipeline (i.e. run locally with a sample to iterate faster, but with the full dataset in Kubernetes).

2. ml-advanced - Shows how to write a machine learning pipeline using the Python API (instead of a pipeline.yaml) file, shows how to create an array of experiments to try several models.

3. etl - Pipeline with a SQL tasks demonstrating how to extract data from a database and then process it with Python and R

A note on mounted volumes

Soopervisor offers a way to configure mounted volumes through an optional soopervisor.yaml file, here we explain the default behavior.

Our cluster has a shared disk that exposes its /exports/ directory. By default, soopervisor expects a volume claim with name nfs and mounts the folder /exports/{project-name} from the shared disk to /mnt/nfs in the Pods, where {project-name} is the name of the directory that contains your project. At runtime, the Pod’s working directory is set to /mnt/nfs.