Task communication ================== Since ``soopervisor`` executes tasks in isolated environments, you must provide a way to pass the output files of each task to upcoming tasks that use them as inputs. There are two ways of doing so: either mount a shared disk on all containers, or configure a ``File`` client to use remote storage; we describe both options in the next sections. Shared disk ----------- .. tab:: K8s/Argo If using K8s/Argo, you can mount a volume on each pod by adding some configuration to your ``soopervisor.yaml`` file. Refer to the :doc:`../api/kubernetes` configuration schema documentation for details. Note that the configuration flexibility is limited; if you need a more flexible approach, you can generate the Argo YAML Spec (by running the ``soopervisor export`` command), and then edit the generated spec to suit your needs (the spec is generated in ``{name}/argo.yaml``, where ``{name}`` is the name of your target environment). .. tab:: Airflow When using Airflow, the ``soopervisor add`` generates an output ``.py`` file with the Airflow DAG, you can edit this file to configure a shared disk and execute ``soopervisor export`` afterwards. The :doc:`../tutorials/airflow` tutorial shows how to do this, you can see the ``.py`` file that the tutorial uses `here `_. .. tab:: AWS Batch To execute pipelines in AWS Batch, you must create a `compute environment `_, map it to a job queue, and include the job queue name in your ``soopervisor.yaml`` file. You can configure a shared disk using Amazon EFS, `click here `_ to learn how to configure EFS in your compute environment. .. tab:: SLURM If running on SLURM, sharing a disk depends on your cluster configuration, so ensure you can mount a disk in all nodes and that your pipeline writes their outputs in the shared disk. .. important:: If using a shared disk, execute ``soopervisor export`` with the ``--skip-tests`` flag, otherwise Soopervisor will raise an error if your pipeline does not have a ``File`` client configured. Using remote storage -------------------- As an alternative, you can configure a ``File`` client to ensure each task has their input files before execution. We currently support Amazon S3 and Google Cloud Storage. To configure a client, add the following to your ``pipeline.yaml`` file: .. code-block:: yaml :caption: pipeline.yaml # configure a client clients: # note the capital F File: clients.get tasks: # content continues... Then, create a ``clients.py`` file (in the same directory as your ``pipeline.yaml``) and declare a ``get`` function that returns a ``File`` client instance: .. tab:: Amazon S3 .. code-block:: python :caption: clients.py from ploomber.clients import S3Client def get(): return S3Client(bucket_name='YOUR-BUCKET-NAME', parent='PARENT-FOLDER-IN-BUCKET', json_credentials_path='credentials.json') `Click here to see the S3Client documentation. `_ .. tab:: Google Cloud Storage .. code-block:: python :caption: clients.py from ploomber.clients import GCloudStorageClient def get(): return GCloudStorageClient(bucket_name='YOUR-BUCKET-NAME', parent='PARENT-FOLDER-IN-BUCKET', json_credentials_path='credentials.json') `Click here to see the GCloudStorageClient documentation. `_ Next, create a ``credentials.json`` (in the same directory as your ``pipeline.yaml``) with your authentication information. The file should look like this: .. tab:: Amazon S3 .. code-block:: json { "aws_access_key_id": "YOUR-ACCESS-KEY-ID", "aws_secret_access_key": "YOU-SECRET-ACCESS-KEY" } .. tab:: Google Cloud Storage .. code-block:: json { "type": "service_account", "project_id": "project-id", "private_key_id": "private-key-id", "private_key": "private-key", "client_email": "client-email", "client_id": "client-id", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account.iam.gserviceaccount.com" } **Note:** If you're using a Docker-based exporter (K8s/Argo, Airflow, or AWS Batch),you must ensure that your ``credentials.json`` file is included in your Docker image. You can ensure this by adding the following to your ``soopervisor.yaml`` .. code-block:: yaml :caption: soopervisor.yaml some-name: # tell soopervisor to include the credentials.json file include: [credentials.json] # continues You can check your local configuration by loading your pipeline using ``ploomber status``. If you see a table listing your tasks, it means the client has been configured successfully. Furthermore, when executing the ``soopervisor export`` command and using a Docker-based exporter (K8s/Argo, Airflow, and AWS Batch), Soopervisor will check that the ``File`` client in the Docker image is correctly configured by trying to establish a connection with your credentials to the remote storage.