question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[feature] Allow jobs to be scheduled on AWS Fargate

See original GitHub issue

Feature Area

/area backend

What feature would you like to see?

I am trying to run a large number of kubeflow pipeline jobs on AWS Fargate.

The kubeflow pipeline components are deployed on AWS EKS. While the EKS has a Fargate profile that allows scheduling pods onto virtual nodes, Kubeflow pipeline jobs contain privileged containers that prevent them from using Fargate machine resources (https://docs.aws.amazon.com/eks/latest/userguide/fargate.html).

What is the use case or pain point?

This feature enables more cost-efficient job scheduling since many jobs (e.g., hyperparameter tuning, scenario analysis …) are ephermal, so scheduling them on a serverless machine pool such as provided by Fargate makes more sense. This avoids the need to reserve a pool of nodes upfront while supporting the burst type of workloads.

However, kubeflow pipeline jobs use privileged containers that are not supported by Fargate. For example, the wait container

  containers:
    - name: wait
      image: 'gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance'
      command:
        - argoexec
        - wait
      env:
        - name: ARGO_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
          value: pns
      resources: {}
      volumeMounts:
        - name: podmetadata
          mountPath: /argo/podmetadata
        - name: mlpipeline-minio-artifact
          readOnly: true
          mountPath: /argo/secret/mlpipeline-minio-artifact
        - name: input-artifacts
          mountPath: /mainctrfs/tmp/inputs/config/data
          subPath: config
        - name: input-artifacts
          mountPath: /mainctrfs/tmp/inputs/data/data
          subPath: convect-prepare-data-out_path
        - name: pipeline-runner-token-j2fm7
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        capabilities:
          add:
            - SYS_PTRACE

needs further configurations under securityContext.

I am wondering if there are any workarounds or better solutions to make the jobs schedulable on serverless resource pools such as Fargate.

Is there a workaround currently?

I do not see any solutions so far.


Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:7
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
yuhuishi-convectcommented, May 4, 2021

I got a walkaround under version 1.2 to allow scheduling jobs onto Fargate nodes. Here are the things I did:

  1. Switch the Argo workflow executor to k8sapi.
kubectl edit cm workflow-controller-configmap  -n kubeflow

and change containerRuntimeExecutor from pns to k8sapi

  1. Modify the components to use emptyDir as the output location. For example, I have a following helper function
def mount_empty_dir(task: kfp.dsl.ContainerOp) -> kfp.dsl.ContainerOp:
  from kubernetes import client as k8s_client
  task = task.add_volume(
    k8s_client.V1Volume(
      empty_dir={},
      name="output-empty-dir"
    )
  )

  task.container.add_volume_mount(
    k8s_client.V1VolumeMount(
      mount_path="/tmp/outputs",
      name="output-empty-dir"
    )
  )

  return task

Then apply the transformation to every op in the pipeline

pipeline_conf.add_op_transformer(
     mount_empty_dir
)
  1. Hint an op can be scheduled on Fargate (this is specific to your Fargate settings). For my case, I am using the rule

Any pod that has a label fargate-schedulable=true under kubeflow namespace can be put on Fargate.

So in the pipeline

task.add_pod_label("fargate-schedulable", "true")

will hint the task can be scheduled on Fargate.

2reactions
Bobgycommented, Apr 30, 2021

@yuhuishi-convect we might want to switch to argo v3 emissary executor: https://argoproj.github.io/argo-workflows/workflow-executors/#emissary-emissary which doesn’t require privileged permission.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS Fargate Now Supports Time and Event-Based Task ...
AWS Fargate now supports the ability to run tasks on a regular, scheduled basis and in response to CloudWatch Events.
Read more >
Scheduling Amazon ECS tasks - AWS Documentation
Amazon ECS provides a service scheduler for long-running tasks and applications. It also provides the ability to run tasks manually for batch jobs...
Read more >
Scheduled tasks - Amazon ECS - AWS Documentation
Amazon ECS supports creating scheduled tasks. Scheduled tasks use Amazon EventBridge rules to run tasks either on a schedule or in a response...
Read more >
Run event-driven and scheduled workloads at scale with AWS ...
The pull request initiates a Lambda function. The Lambda function invokes a Fargate task that takes care of the code scan. Lambda is...
Read more >
Creating a single-node job definition on AWS Fargate resources
Open the AWS Batch console at AWS Batch console first-run wizard . From the top navigation bar, choose the Region to use. In...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found