question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected stop of container build [CML, AWS, github]

See original GitHub issue

My scenario: I’m trying to create appropriate pipeline for my ML project. I’m using the following CML yaml file:

on:
  # Trigger the workflow on push or pull request
  push:
    branches:
      - mybranch

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on EC2
        env:
          PERSONAL_ACCESS_TOKEN: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-1
        run: |
          cml-runner \
              --repo https://github.com/MyCompany/myrepo \
              --token=$PERSONAL_ACCESS_TOKEN \
              --cloud aws \
              --cloud-region us-west-1 \
              --cloud-type=g3.4xlarge \
              --labels=cml-runner \
              --idle-timeout 30
    
  model-training:
    needs: [deploy-runner]
    runs-on: [self-hosted, cml-runner]
    container:
      image: docker://dvcorg/cml:0-dvc1-base1-gpu
      options: --gpus all
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.6'
      - name: Train model
        env:
          repo_token: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.DVC_ACCESS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.DVC_SECRET_KEY }}
          REQUIREMENTS_FILE: 'training/training_req.txt'
        run: |
          export AWS_DEFAULT_REGION=us-east-1
          echo "Install reqs"
          sudo apt update
          sudo apt-get install default-jre scala
          pip install py4j
          pip install --no-cache-dir -e .
          export PYSPARK_PYTHON=python3
          echo "Start CML"
          python3 -m spacy download en_core_web_sm
          echo "Pull data"
          dvc repro
          echo "## Model metrics" > report.md
          cat prepare_data/metrics.txt >> report.md
          cml-send-comment report.md
          

As you can notice I used the following image docker://dvcorg/cml:0-dvc1-base1-gpu, but I started receive the following error message:

Screenshot 2021-07-27 at 20 22 31 Screenshot 2021-07-27 at 20 24 36

I can see that container started to build but unexpectedly stopped, and I do not see the reason of this behavior. Actually i did not change anything in my script, and it just stopped work, but earlier I run it successfully.

Thanks!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
DavidGOrtegacommented, Jul 28, 2021

fixed! Seems that the error that appeared later on was a hicup. I have tried multiple times successfully

1reaction
sergeychuvakincommented, Jul 28, 2021

Thanks! I was able to run it as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

failed to solve with frontend dockerfile.v0: failed to build LLB ...
Stop the containers using the docker image; Remove the volumes used by the containers. Rebuild the image. Alternatively, tag the build image ...
Read more >
Docker fails to start containers with cgroup memory allocation ...
This issue has been fixed in the past by restarting the docker daemon or rebooting the machine although the docker daemon is active...
Read more >
Error when trying to run docker-compose up. "oci runtime error ...
When trying to launch a built container with docker-compose up I'm getting an error: ERROR: for app Cannot start service app: invalid header ......
Read more >
Amazon ECS Container Agent - GitHub
Environment Key Example Value(s) Description ECS_CLUSTER clusterName The cluster this agent should check into. AWS_ACCESS_KEY_ID AKIDEXAMPLE The access key used by the agent for all... AWS_SECRET_ACCESS_KEY...
Read more >
ECS continues to stop and restart new container with exit code 0
2018-02-16T07:24:01Z [INFO] Managed task [arn:aws:ecs:us-west-2:035804961478:task/c13ba3f3-6ac8-49c5-a649-3d90e363ce4d]: Cgroup resource set up for task ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found