Additional setup when enabling gpu (avoid the crash loop)
See original GitHub issueI’ve spent an entire day trying to debug why enabling the GPU on 1.21 on Ubuntu 20.04 resulted in a crash loop.
I tried different versions of the gpu operator by modifying the enable.gpu.sh script, and I looked at a lot of logs.
In the end this issue had the solution: https://github.com/NVIDIA/gpu-operator/issues/173#issuecomment-804836922
For completeness sake:
# Update /etc/systemd/system/multi-user.target.wants/snap.microk8s.daemon-containerd.service
- Restart=on-failure
- Type=simple
+ Restart=always
+ Type=notify
+ Delegate=yes
+ KillMode=process
sudo systemctl daemon-reload
sudo systemctl restart snap.microk8s.daemon-containerd
As I expect that more people may run into this particular problem, it would be nice to document this in the troubleshooting section.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Additional setup when enabling gpu (avoid the crash loop)
I've spent an entire day trying to debug why enabling the GPU on 1.21 on Ubuntu 20.04 resulted in a crash loop. I...
Read more >Fix Graphics Card Only Works in Safe Mode - YouTube
Fix Graphics Card Only Works in Safe Mode▻▻▻SUBSCRIBE for more: https://www.youtube.com/user/Britec09?sub_confirmation=1Nvidia driver ...
Read more >How to fix a bad overclock on any Video Card - YouTube
Learn more about Cooler Masters lineup at http://www.coolermaster.com○○○ All music provided with permission by audiomicro.com ...
Read more >How to Fix Deathloop Crashing on PC [Easy Steps]
Navigate to the game's installation folder and add the Deathloop.exe file. Click Options. Select High performance and click Save. Fix 3 – Update ......
Read more >Thermal Throttling Guide (Prevent your GPU & CPU from ...
Hot PC components are thermally throttled and you miss out on potential performance. Here's how to make your CPU and GPU perform to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This only works properly on systems where you don’t have a UI running already and you may not have the drivers already installed on the system. If they are, you need to remove the installation and blacklist the nouveau driver. You absolutely want the card to be not in use already.
If you already have that situation, and it just can’t unload the drivers from the container, try listing the nvidia related modules with
lsmod | grep nvidia
and unload them one by one withrmmod
, (first it should be the two things that depend on the nvidia module, and then the nvidia module itself). If it tells you that the module is still in use, make sure X or gdm isn’t running.If you need to have a GUI running at the same time or use the host systems drivers, you should probably use microk8s version 1.20, as that is the default for that.
Or if you feel adventurous, you can copy the
gpu.enable.sh
script from your microk8s installation, and modify the helm parameters for the gpu-operator (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#chart-customization-options). It does support using host drivers, but that way you don’t get automatic driver installation when you add additional nodes, so I didn’t dive into that too much.You’ll need to define a few environment variables for the script to run directly (i.e. when it is not invoked via
microk8s enable ...
) but figuring them out is quite straight forward.The above PR should address this issue. As soon it gets merged a new snap should land on the latest/edge channel.