All nodes in pool in state `starttaskfailed`: "Docker root dir $rootdir not within $USER_MOUNTPOINT"
See original GitHub issueProblem Description
I have been using Azure Batch Shipyard with VMs of type STANDARD_NC6 succesfully for a while. Usually, I create a pool, submit some jobs (with several tasks) and kill the pool again, all over the course of at most a couple of days.
As of today, when creating the pool and submitting a job, all nodes enter the “starttaskfailed” state. I have deleted and recreated the pool and job several times. Using the Azure Batch Explorer I have checked the node startup logs and find the following text at the bottom of stdout.txt
:
Client: Docker Engine - Community
Version: 19.03.0
API version: 1.39 (downgraded from 1.40)
Go version: go1.12.5
Git commit: aeac949
Built: Wed Jul 17 18:16:07 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:42:13 2019
OS/Arch: linux/amd64
Experimental: false
2019-07-23T11:11:50,787210380+00:00 - ERROR - Docker root dir Dir: not within /mnt
This seems to originate from shipyard_nodeprep.sh
line 730-737:
local rootdir
rootdir=$(docker info | grep "Docker Root Dir" | cut -d' ' -f 4)
if echo "$rootdir" | grep "$USER_MOUNTPOINT" > /dev/null; then
log DEBUG "Docker root dir: $rootdir"
else
log ERROR "Docker root dir $rootdir not within $USER_MOUNTPOINT"
exit 1
fi
It looks like the cut
command does not properly extract the “Docker Root Dir” from the output of docker info
(note that $rootdir = "Dir:"
!).
Batch Shipyard Version
3.7.0
Steps to Reproduce
Create pool, then create job.
Expected Results
The pool gets created and the job + tasks start properly.
Actual Results
All nodes in the pool get stuck in “starttaskfailed”
Redacted Configuration
pool:
pool_specification:
id: my-pool
vm_configuration:
platform_image:
offer: UbuntuServer
publisher: Canonical
sku: 16.04-LTS
vm_count:
dedicated: 0
low_priority: 10
vm_size: STANDARD_NC6
Additional Logs
Header part from stdout.txt
:
Configuration:
--------------
Custom image: 0
Native mode: 0
OS Distribution: ubuntu 16.04
Batch Shipyard version: 3.7.0
Blobxfer version: 1.7.0
Singularity version:
User mountpoint: /mnt
Mount path: /mnt/batch/tasks/mounts
Batch Insights: 0
Prometheus: NE=, CA=,
Network optimization: 1
Encryption cert thumbprint:
Install Kata Containers: 0
Default container runtime: runc
Install BeeGFS BeeOND: 0
Storage cluster mount:
Custom mount:
Install LIS:
GPU: False:nvidia-driver_cc37.run
Azure Blob: 1
Azure File: 0
GlusterFS on compute: 0
HPN-SSH: 0
Enable Azure Batch group for Docker access:
Fallback registry:
Docker image preload delay: 0
Cascade via container: 1
P2P: 0
Block on images: REDACTED#
Additonal Comments
Issue Analytics
- State:
- Created 4 years ago
- Comments:8
Top GitHub Comments
Thank you @alfpark ! workaround is working
but we had to recreate our pool from a clean state using a recompiled version of shipyard
@alfpark Thanks a ton for providing such a quick workaround! Using
native: true
has indeed resolved this issue for me.