question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compute nodes suddenly failing to start

See original GitHub issue

Environment:

  • aws-parallelcluster-2.4.1
  • OS: alinux
  • Scheduler: Slurm
  • Master instance type: t2.medium
  • Compute instance type: t2.2xlarge

Bug description and how to reproduce: Compute nodes are suddenly failing to spawn. Initial cluster creation worked fine but new nodes keep dying due to a failure in the initialization.

Additional context:

Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: config-scripts-per-once already ran (freq=once)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-boot (<module 'cloudinit.config.cc_scripts_per_boot' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_boot.pyc'>) with frequency always
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-boot using lock (<cloudinit.helpers.DummyLock object at 0x7f020280d810>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-per-instance (<module 'cloudinit.config.cc_scripts_per_instance' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_per_instance.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-per-instance using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_per_instance'>)
Sep 27 14:27:23 cloud-init[3245]: stages.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) with frequency once-per-instance
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user - wb: [644] 20 bytes
Sep 27 14:27:23 cloud-init[3245]: helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_scripts_user'>)
Sep 27 14:27:23 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-002'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-002 [1]
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 645, in runparts
    subp(prefix + [exe_path], capture=False, shell=True)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 1626, in subp
    cmd=args)
ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/part-002']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=True, capture=False)
Sep 27 14:34:38 cloud-init[3245]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Sep 27 14:34:38 cloud-init[3245]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/stages.py", line 660, in _run_modules
    cc.run(run_name, mod.handle, func_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/cloud.py", line 63, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python2.7/dist-packages/cloudinit/helpers.py", line 197, in run
    results = functor(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 652, in runparts
    % (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 2 attempted commands
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-ssh-authkey-fingerprints using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_ssh_authkey_fingerprints'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 512 bytes from /etc/ssh/sshd_config
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /home/ec2-user/.ssh/authorized_keys (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 404 bytes from /home/ec2-user/.ssh/authorized_keys
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module keys-to-console (<module 'cloudinit.config.cc_keys_to_console' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_keys_to_console.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-keys-to-console using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_keys_to_console'>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Running command ['/usr/libexec/cloud-init/write-ssh-key-fingerprints', '', 'ssh-dss'] with allowed return codes [0] (shell=False, capture=True)
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module phone-home (<module 'cloudinit.config.cc_phone_home' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_phone_home.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-phone-home using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_phone_home'>)
Sep 27 14:34:38 cloud-init[3245]: cc_phone_home.py[DEBUG]: Skipping module named phone-home, no 'phone_home' configuration found
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module final-message (<module 'cloudinit.config.cc_final_message' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_final_message.pyc'>) with frequency always
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-final-message using lock (<cloudinit.helpers.DummyLock object at 0x7f02024694d0>)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Cloud-init v. 0.7.6 finished at Fri, 27 Sep 2019 14:34:38 +0000. Datasource DataSourceEc2.  Up 460.71 seconds
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instance/boot-finished - wb: [644] 52 bytes
Sep 27 14:34:38 cloud-init[3245]: stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python2.7/dist-packages/cloudinit/config/cc_power_state_change.pyc'>) with frequency once-per-instance
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change - wb: [644] 20 bytes
Sep 27 14:34:38 cloud-init[3245]: helpers.py[DEBUG]: Running config-power-state-change using lock (<FileLock using file '/var/lib/cloud/instances/i-08f7d7aabbf4d517c/sem/config_power_state_change'>)
Sep 27 14:34:38 cloud-init[3245]: cc_power_state_change.py[DEBUG]: no power_state provided. doing nothing
Sep 27 14:34:38 cloud-init[3245]: cloud-init[DEBUG]: Ran 9 modules with 1 failures
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: Read 15 bytes from /proc/uptime
Sep 27 14:34:38 cloud-init[3245]: util.py[DEBUG]: cloud-init mode 'modules' took 435.260 seconds (435.05)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:21 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
medceleratecommented, Sep 30, 2019

Yes, I tested without a custom ami using amazon Linux , I spun up a machine added the necessary tools and then generated an image. The image isn’t public at the moment. It seems to be an issue with the intel mpi repo as well. The imported key wasn’t checking properly.

On Mon, Sep 30, 2019 at 4:12 AM Enrico Usai notifications@github.com wrote:

Hi @medcelerate https://github.com/medcelerate thank you for your analysis and your logs.

I see you are using a custom_ami. It could be a problem related to your custom ami.

  1. Did you test without the custom_ami parameter?
  2. Which https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html process did you follow to build your custom AMI?
  3. Which is the source AMI Id of your AMI (if is it public)?

Thank you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/aws-parallelcluster/issues/1334?email_source=notifications&email_token=AHYKRGNA235FYN7RYPPWBSDQMGYG7A5CNFSM4I3HHWCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD742BBI#issuecomment-536453253, or mute the thread https://github.com/notifications/unsubscribe-auth/AHYKRGOHJFIFPLUBVJKEIV3QMGYG7ANCNFSM4I3HHWCA .

1reaction
sean-smithcommented, Sep 30, 2019

@medcelerate @jflournoy There’s an issue with the GPG key required for Intel MPI, we’re currently working on a fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Compute node issues - IBM
A compute node can abruptly become unreachable either due to a power outage or network disruption. Cause: The pods that run on that...
Read more >
Compute Node Failures and Maintenance
Compute Node Failures and Maintenance¶ Sometimes a compute node either crashes unexpectedly or requires a reboot for maintenance reasons.
Read more >
AWS ParallelCluster troubleshooting
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >
How to Fix Kubernetes 'Node Not Ready' Error - Komodor
Node Not Ready error indicates a machine in a K8s cluster that cannot run pods. ... NotReady compute 5h v1.17 node2.example.com Ready compute...
Read more >
Check for pool and node errors - Azure Batch - Microsoft Learn
Resize timeout or failure · Under most circumstances, the default timeout of 15 minutes is long enough for pool nodes to be allocated...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found