Compute nodes suddenly failing to start
See original GitHub issueEnvironment:
- aws-parallelcluster-2.4.1
- OS: alinux
- Scheduler: Slurm
- Master instance type: t2.medium
- Compute instance type: t2.2xlarge
Bug description and how to reproduce: Compute nodes are suddenly failing to spawn. Initial cluster creation worked fine but new nodes keep dying due to a failure in the initialization.
Additional context:
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: config-scripts-per-once already ran (freq=once)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-boot (<module 'cloudinit.config.cc_scripts_per_boot' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_boot.pyc'>) with frequency always
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-boot using lock (<cloudinit.helpers.DummyLock object at 0x7f020280d810>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-instance (<module 'cloudinit.config.cc_scripts_per_instance' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_instance.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-instance using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance'>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user'>)
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-002'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 645, in runparts
subp(prefix + [exe_path], capture=False, shell=True)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1626, in subp
cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-002']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
results = functor(*args)
File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
util.runparts(runparts_path)
File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 652, in runparts
% (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 2 attempted commands
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 512 bytes from /etc/ssh/sshd_config
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /home/ec2-user/.ssh/authorized_keys (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 404 bytes from /home/ec2-user/.ssh/authorized_keys
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module keys-to-console (<module 'cloudinit.config.cc_keys_to_console' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_keys_to_console.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-keys-to-console using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/usr/libexec/cloud-init/write-ssh-key-fingerprints', '', 'ssh-dss'] with allowed return codes [0] (shell=False, capture=True)
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module phone-home (<module 'cloudinit.config.cc_phone_home' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_phone_home.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-phone-home using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home'>)
Sep 27 14:34:38 cloud-init[3245]: cc_phone_home.py[DEBUG]: Skipping module named phone-home, no 'phone_home' configuration found
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module final-message (<module 'cloudinit.config.cc_final_message' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_final_message.pyc'>) with frequency always
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-final-message using lock (<cloudinit.helpers.DummyLock object at 0x7f02024694d0>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Cloud-init v. 0.7.6 finished at Fri, 27 Sep 2019 14:34:38 +0000. Datasource DataSourceEc2. Up 460.71 seconds
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instance/boot-finished - wb: [644] 52 bytes
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_power_state_change.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change'>)
Sep 27 14:34:38 cloud-init[3245]: cc_power_state_change.py[DEBUG]: no power_state provided. doing nothing
Sep 27 14:34:38 cloud-init[3245]: cloud-init[DEBUG]: Ran 9 modules with 1 failures
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: cloud-init mode 'modules' took 435.260 seconds (435.05)
Issue Analytics
- State:
- Created 4 years ago
- Comments:21 (9 by maintainers)
Top Results From Across the Web
Compute node issues - IBM
A compute node can abruptly become unreachable either due to a power outage or network disruption. Cause: The pods that run on that...
Read more >Compute Node Failures and Maintenance
Compute Node Failures and Maintenance¶ Sometimes a compute node either crashes unexpectedly or requires a reboot for maintenance reasons.
Read more >AWS ParallelCluster troubleshooting
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >How to Fix Kubernetes 'Node Not Ready' Error - Komodor
Node Not Ready error indicates a machine in a K8s cluster that cannot run pods. ... NotReady compute 5h v1.17 node2.example.com Ready compute...
Read more >Check for pool and node errors - Azure Batch - Microsoft Learn
Resize timeout or failure · Under most circumstances, the default timeout of 15 minutes is long enough for pool nodes to be allocated...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, I tested without a custom ami using amazon Linux , I spun up a machine added the necessary tools and then generated an image. The image isn’t public at the moment. It seems to be an issue with the intel mpi repo as well. The imported key wasn’t checking properly.
On Mon, Sep 30, 2019 at 4:12 AM Enrico Usai notifications@github.com wrote:
@medcelerate @jflournoy There’s an issue with the GPG key required for Intel MPI, we’re currently working on a fix.