question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Additional setup when enabling gpu (avoid the crash loop)

See original GitHub issue

I’ve spent an entire day trying to debug why enabling the GPU on 1.21 on Ubuntu 20.04 resulted in a crash loop.

I tried different versions of the gpu operator by modifying the enable.gpu.sh script, and I looked at a lot of logs.

In the end this issue had the solution: https://github.com/NVIDIA/gpu-operator/issues/173#issuecomment-804836922

For completeness sake:

# Update /etc/systemd/system/multi-user.target.wants/snap.microk8s.daemon-containerd.service
- Restart=on-failure
- Type=simple
+ Restart=always
+ Type=notify
+ Delegate=yes
+ KillMode=process

sudo systemctl daemon-reload
sudo systemctl restart snap.microk8s.daemon-containerd

As I expect that more people may run into this particular problem, it would be nice to document this in the troubleshooting section.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
treocommented, Jun 12, 2021

Could not unload NVIDIA driver kernel modules, driver is in use

This only works properly on systems where you don’t have a UI running already and you may not have the drivers already installed on the system. If they are, you need to remove the installation and blacklist the nouveau driver. You absolutely want the card to be not in use already.

If you already have that situation, and it just can’t unload the drivers from the container, try listing the nvidia related modules with lsmod | grep nvidia and unload them one by one with rmmod, (first it should be the two things that depend on the nvidia module, and then the nvidia module itself). If it tells you that the module is still in use, make sure X or gdm isn’t running.

If you need to have a GUI running at the same time or use the host systems drivers, you should probably use microk8s version 1.20, as that is the default for that.

Or if you feel adventurous, you can copy the gpu.enable.sh script from your microk8s installation, and modify the helm parameters for the gpu-operator (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#chart-customization-options). It does support using host drivers, but that way you don’t get automatic driver installation when you add additional nodes, so I didn’t dive into that too much.

You’ll need to define a few environment variables for the script to run directly (i.e. when it is not invoked via microk8s enable ...) but figuring them out is quite straight forward.

0reactions
ktsakalozoscommented, Jun 18, 2021

The above PR should address this issue. As soon it gets merged a new snap should land on the latest/edge channel.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Additional setup when enabling gpu (avoid the crash loop)
I've spent an entire day trying to debug why enabling the GPU on 1.21 on Ubuntu 20.04 resulted in a crash loop. I...
Read more >
Fix Graphics Card Only Works in Safe Mode - YouTube
Fix Graphics Card Only Works in Safe Mode▻▻▻SUBSCRIBE for more: https://www.youtube.com/user/Britec09?sub_confirmation=1Nvidia driver ...
Read more >
How to fix a bad overclock on any Video Card - YouTube
Learn more about Cooler Masters lineup at http://www.coolermaster.com○○○ All music provided with permission by audiomicro.com ...
Read more >
How to Fix Deathloop Crashing on PC [Easy Steps]
Navigate to the game's installation folder and add the Deathloop.exe file. Click Options. Select High performance and click Save. Fix 3 – Update ......
Read more >
Thermal Throttling Guide (Prevent your GPU & CPU from ...
Hot PC components are thermally throttled and you miss out on potential performance. Here's how to make your CPU and GPU perform to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found