Kubernetes (Argo)

This tutorial shows how to run a pipeline in Kubernetes via Argo Workflows. locally using minikube or in Google Cloud.

If you encounter any issues with this tutorial, let us know.

Click here to see the Argo Community Meeting talk.

Example 1: minikube

This first tutorial runs a pipeline in a local Kubernetes cluster using minikube.

Pre-requisites

Instructions

We first create a local Kubernetes cluster and install Argo:

minikube start --disk-size 10GB

# install argo
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml

Submit a sample workflow to make sure Argo is working:

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml

Since tasks need to load artifacts generated by upstream tasks we create a shared directory to store everything there:

# create a folder
mkdir $HOME/minikube

# mount shared folder to kubernetes
minikube mount $HOME/minikube:/host

Tip

Enable Argo’s UI. Run this in a new terminal:

# port forwarding to enable the UI
kubectl -n argo port-forward svc/argo-server 2746:2746

Then, open: http://127.0.0.1:2746

Let’s now run a Ploomber sample Machine Learning pipeline:

# get the sample projects
git clone https://github.com/ploomber/projects
cd projects/ml-online/

# configure development environment
ploomber install

# activate environment
conda activate ml-online

# configure docker environment to use minikube
eval $(minikube docker-env) # unix
minikube docker-env | Invoke-Expression # windows power shell

# add a new target platform
soopervisor add training --backend argo-workflows

The last command will create a soopervisor.yaml file. We need to make a few modifications. Paste the following:

# configuration for the target platform
training:
  backend: argo-workflows
  # we are not uploading the docker image, set it as null
  repository: null
  # mount the /host folder (which is linked to $HOME/minikube), it will
  # be visible to pods in /mnt/shared-folder
  mounted_volumes:
    - name: shared-folder
      spec:
        hostPath:
          path: /host

Now, we must configure the project to store all outputs in the shared folder. Create an env.yaml file with the following content, make sure you create it in the root directory (the same folder that contains the setup.py file):

sample: False
product_root: /mnt/shared-folder

Let’s now submit the workflow:

# build docker image (takes a few minutes the first time) and generate yaml spec
soopervisor export training

# submit workflow
argo submit -n argo --watch training/argo.yaml

You may also watch the progress from the UI.

Once the execution finishes, take a look at the generated artifacts:

ls $HOME/minikube/

To delete the cluster:

minikube delete

Congratulations! You just ran Ploomber on Kubernetes!

Example 2: Google Cloud

This second tutorial runs a pipeline in a local Kubernetes cluster using Google Cloud.

Note

You may use or create a new Google Cloud project to follow this tutorial.

Pre-requisites

Instructions

Create a cluster and install Argo:

# create cluster
gcloud container clusters create my-cluster --num-nodes=1 --zone us-east1-b

# install argo
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/quick-start-postgres.yaml

# create storage bucket (choose whatever name you want)
gsutil mb gs://YOUR-BUCKET-NAME

Submit a sample workflow to make sure Argo is working:

argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml

Tip

Enable Argo’s UI:

# port forwarding to enable the UI
kubectl -n argo port-forward svc/argo-server 2746:2746

Then, open: http://127.0.0.1:2746

Let’s now run a Ploomber sample Machine Learning pipeline:

# get the sample projects
git clone https://github.com/ploomber/projects
cd projects/ml-online/

# configure development environment
ploomber install

# activate environment
conda activate ml-online

# add a new target platform
soopervisor add training --backend argo-workflows

The previous command creates a soopervisor.yaml file where we can configure the container registry to upload our Docker image:

training:
  backend: argo-workflows
  repository: gcr.io/PROJECT-ID/my-ploomber-pipeline

Replace PROJECT-ID with your actual project ID.

Each task will run in isolation, we must ensure that products generated by a given task are available to its corresponding downstream tasks. Ww can use Google Cloud Storage for that, add the following to the src/ml_online/pipeline.yaml file:

# more content above...

serializer: ml_online.io.serialize
unserializer: ml_online.io.unserialize

# add these two lines
clients:
  File: ml_online.clients.get_gcloud

# content continues...

The previous change tells Ploomber to call the function get_gcloud defined in module src/ml_online/clients.py to get the client. Edit the clients.py to add your bucket name:

You can ignore the rest of the file. Finally, we add service account credentials to upload to Google Cloud Storage. To learn more about service accounts, click here.

Store the service account details in a credentials.json in the root project directory (same folder as setup.py):

We are ready to execute the workflow:

# authenticate to push docker image
gcloud auth configure-docker

# packages code, create docker image and upload it (takes a few mins)
soopervisor export training

# submit workflow
argo submit -n argo training/argo.yaml

You may keep track of execution by opening the UI. Check out the bucket to see output.

Congratulations! You just ran Ploomber on Kubernetes!

Attention

Make sure you delete your cluster, bucket, and image after running this example!

# delete cluster
gcloud container clusters delete my-cluster --zone us-east1-b

# delete bucket
gsutil rm -r gs://my-sample-ploomber-bucket

# delete image (you can get the image id from the google cloud console)
gcloud container images delete IMAGE-ID

Optional: Mounting a shared disk

Note

If you use a shared disk instead of storing artifacts in S3 or Google Cloud Storage, you must execute the pipeline with the --skip-tests flag. e.g., soopervisor export training --skip-tests, otherwise the command will fail if your project does not have a remote storage client configured.

In the example, we configured the pipeline.yaml file to use Google Cloud Storage to store artifacts, this serves two purposes: 1) Make artifacts available to us upon execution, and 2) Make artifacts available to dowstream tasks.

This happens because pods run in isolation, if task B depends on task A, it will fetch A’s output from cloud storage before execution. We can save dowload time (and cut costs) by mounting a shared volume so that B doesn’t have to download A’s output. Ploomber automatically detects this change and only calls the cloud storage API for uploading.

Here’s how to configure a shared disk:

# create disk. make sure the zone matches your cluster
gcloud compute disks create --size=10GB --zone=us-east1-b gce-nfs-disk

# configure the nfs server
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/01-nfs-server.yaml
kubectl apply -f 01-nfs-server.yaml

# create service
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/02-nfs-service.yaml
kubectl apply -f 02-nfs-service.yaml

# check service
kubectl get svc nfs-server

# create persistent volume claim
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/03-nfs-pv-pvc.yaml
kubectl apply -f 03-nfs-pv-pvc.yaml

Optionally, you can check that the disk is properly configured by running this sample workflow:

# run sample workflow (uses nfs and creates an empty file on it)
curl -O https://raw.githubusercontent.com/ploomber/soopervisor/master/doc/assets/dag.yaml
argo submit -n argo --watch dag.yaml

Check the output:

# get nfs-server pod name
kubectl get pod

# replace with the name of the pod
kubectl exec --stdin --tty {nfs-server-pod-name} -- /bin/bash

Once inside the Pod, run:

ls /exports/

You should see files A, B, C, D. Generated by the previous workflow.

Let’s now run the Machine Learning workflow. Since we configured a shared disk, artifacts from upstream tasks will be available to downstream ones (no need to download them from S3 anymore); the S3 client is only used to upload artifacts for us to review later.

To make the shared disk available to the pods that run each task, we have to modify soopervisor.yaml:

training:
  backend: argo-workflows
  repository: gcr.io/your-project/your-repository
  mounted_volumes:
    - name: nfs
      sub_path: my-shared-folder
      spec:
        persistentVolumeClaim:
          claimName: nfs

This exposes /my-shared-folder sub directory in our shared disk in /mnt/nfs/ on each pod. Now, we must configure the pipeline to store all products in /mnt/nfs/. Create an env.yaml file in the root folder (same folder that contains the setup.py file) with this content:

sample: False
# this configures the pipeline to store all outputs in the shared disk
product_root: /mnt/nfs