Some issues in Linux after Cuda 11.1, LibTorch 1.8.0
See original GitHub issueI have tested the updates introduced by #294 on two separate Linux machines with nvidia GPUs.
Sharing the results here.
This is the test code:
open DiffSharp
open DiffSharp.Util
dsharp.seed(1)
print (dsharp.devices())
dsharp.config(backend=Backend.Torch, device=Device.GPU)
let t = dsharp.tensor([1,2,3])
print t.device
let fwdx = dsharp.tensor([[[ 0.1264; 5.3183; 6.6905; -10.6416];
[ 13.8060; 4.5253; 2.8568; -3.2037];
[ -0.5796; -2.7937; -3.3662; -1.3017]];
[[ -2.8910; 3.9349; -4.3892; -2.6051];
[ 4.2547; 2.6049; -9.8226; -5.4543];
[ -0.9674; 1.0070; -4.6518; 7.1702]]])
let fwdy = dsharp.tensor([[[ 4.0332e+00; 6.3036e+00];
[ 8.4410e+00; -5.7543e+00];
[-5.6937e-03; -6.7241e+00]];
[[-2.2619e+00; 1.2082e+00];
[-1.2203e-01; -4.9373e+00];
[-4.1881e+00; -3.4198e+00]]])
let fwdz = dsharp.conv1d(fwdx, fwdy, stride=1)
print fwdz
Machine 1: Ubuntu 20.04.2 LTS, nvidia driver 460.39, CUDA 11.2
External libtorch
When I use a libtorch version 1.8.0+cu111
installed through regular PyTorch install
System.Runtime.InteropServices.NativeLibrary.Load("/home/gunes/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch.so")
things work and I get
[Device (CPU, -1)]
Device (CUDA, 0)
tensor([[[143.319, 108.034, 11.225],
[-5.90653, 4.60892, 6.02813]],
[[27.3029, 97.9861, -133.838],
[-1.47925, 45.6667, 29.8702]]])
Note that CUDA is not listed in dsharp.devices()
nuget libtorch
When I use libtorch nuget
#r "nuget: libtorch-cuda-11.1-linux-x64, 1.8.0.7"
things fail
[Device (CPU, -1)]
System.DllNotFoundException: Unable to load shared library '/home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part4/1.8.0.7/runtimes/linux-x64/native/libtorch_cuda.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
at System.Runtime.InteropServices.NativeLibrary.LoadFromPath(String libraryName, Boolean throwOnError)
at System.Runtime.InteropServices.NativeLibrary.Load(String libraryPath)
at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 92
at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@98-2.Invoke(Assembly delegateArg0, String delegateArg1) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 98
at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
at System.Runtime.InteropServices.NativeLibrary.LoadByName(String libraryName, QCallAssembly callingAssembly, Boolean hasDllImportSearchPathFlag, UInt32 dllImportSearchPathFlag, Boolean throwOnError)
at System.Runtime.InteropServices.NativeLibrary.LoadLibraryByName(String libraryName, Assembly assembly, Nullable`1 searchPath, Boolean throwOnError)
at System.Runtime.InteropServices.NativeLibrary.TryLoad(String libraryName, Assembly assembly, Nullable`1 searchPath, IntPtr& handle)
at TorchSharp.Torch.LoadNativeBackend(Boolean useCudaBackend)
at TorchSharp.Torch.TryInitializeDeviceType(DeviceType deviceType)
at TorchSharp.Torch.InitializeDeviceType(DeviceType deviceType)
at TorchSharp.Tensor.TorchTensor.to_device(DeviceType deviceType, Int32 deviceIndex)
at DiffSharp.Backends.Torch.Utils.torchMoveTo(TorchTensor tt, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 69
at DiffSharp.Backends.Torch.TorchTensorOps`2.CreateFromFlatArray(Array values, Int32[] shape, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1122
at DiffSharp.Backends.Torch.TorchBackendTensorStatics.CreateFromFlatArray(Array values, Int32[] shape, Dtype dtype, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1456
at DiffSharp.Backends.RawTensor.Create(Object values, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/RawTensor.fs:line 231
at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 817
at DiffSharp.dsharp.tensor(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 33
at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 1101
at <StartupCode$FSI_0002>.$FSI_0002.main@()
Stopped due to error
Machine 2: Ubuntu 18.04.5 LTS, nvidia driver 450.102.04, CUDA 11.0
external libtorch
System.Runtime.InteropServices.NativeLibrary.Load("/home/gunes/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so")
[Device (CPU, -1)]
System.DllNotFoundException: Unable to load shared library '/home/gunes/git/github/DiffSharp/DiffSharp/examples/../tests/DiffSharp.Tests/bin/Debug/net5.0/runtimes/linux-x64/native/libLibTorchSharp.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: /lib/x86_64-linux-gnu/libpthread.so.0: version `GLIBC_2.30' not found (required by /home/gunes/git/github/DiffSharp/DiffSharp/examples/../tests/DiffSharp.Tests/bin/Debug/net5.0/runtimes/linux-x64/native/libLibTorchSharp.so)
at System.Runtime.InteropServices.NativeLibrary.LoadFromPath(String libraryName, Boolean throwOnError)
at System.Runtime.InteropServices.NativeLibrary.Load(String libraryPath)
at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 92
at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@98-2.Invoke(Assembly delegateArg0, String delegateArg1) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 98
at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
at TorchSharp.Tensor.Float32Tensor.THSTensor_newFloat32Scalar(Single scalar, Int32 deviceType, Int32 deviceIndex, Boolean requiresGrad)
at TorchSharp.Tensor.Float32Tensor.from(Single scalar, DeviceType deviceType, Int32 deviceIndex, Boolean requiresGrad)
at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@1128-9.Invoke(Single v) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1128
at DiffSharp.Backends.Torch.TorchTensorOps`2.CreateFromFlatArray(Array values, Int32[] shape, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1120
at DiffSharp.Backends.Torch.TorchBackendTensorStatics.CreateFromFlatArray(Array values, Int32[] shape, Dtype dtype, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1456
at DiffSharp.Backends.RawTensor.Create(Object values, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/RawTensor.fs:line 231
at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 817
at DiffSharp.dsharp.tensor(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 33
at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 1101
at <StartupCode$FSI_0001>.$FSI_0001.main@()
Stopped due to error
nuget libtorch
#r "nuget: libtorch-cuda-11.1-linux-x64, 1.8.0.7"
Failure with following output. The main problem seems to be missing GLIBC_2.30
.
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment1/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment1
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment2/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment2
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment3/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment3
Writing restored primary file at /tmp/tmplBc0AP.tmp
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-primary/1.8.0.7/runtimes/linux-x64/native/libtorch_cuda_cu.so
Moving /tmp/tmplBc0AP.tmp --> /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-primary/1.8.0.7/runtimes/linux-x64/native/libtorch_cuda_cu.so
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment1/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment1
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment2/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment2
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment3/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment3
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment4/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment4
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment5/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment5
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment6/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment6
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment7/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment7
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part3-fragment8/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cu.so.fragment8
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment1/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment1
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment2/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment2
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment3/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment3
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment4/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment4
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment5/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment5
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment6/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment6
Found fragment file at /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment7/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment7
Writing restored primary file at /tmp/tmpTrOUXw.tmp
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-primary/1.8.0.7/runtimes/linux-x64/native/libtorch_cuda_cpp.so
Moving /tmp/tmpTrOUXw.tmp --> /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-primary/1.8.0.7/runtimes/linux-x64/native/libtorch_cuda_cpp.so
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment1/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment1
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment2/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment2
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment3/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment3
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment4/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment4
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment5/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment5
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment6/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment6
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment7/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment7
Deleting /home/gunes/.nuget/packages/libtorch-cuda-11.1-linux-x64-part2-fragment8/1.8.0.7/fragments/linux-x64/native/libtorch_cuda_cpp.so.fragment8
[Device (CPU, -1)]
System.DllNotFoundException: Unable to load shared library '/home/gunes/git/github/DiffSharp/DiffSharp/examples/../tests/DiffSharp.Tests/bin/Debug/net5.0/runtimes/linux-x64/native/libLibTorchSharp.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: /lib/x86_64-linux-gnu/libpthread.so.0: version `GLIBC_2.30' not found (required by /home/gunes/git/github/DiffSharp/DiffSharp/examples/../tests/DiffSharp.Tests/bin/Debug/net5.0/runtimes/linux-x64/native/libLibTorchSharp.so)
at System.Runtime.InteropServices.NativeLibrary.LoadFromPath(String libraryName, Boolean throwOnError)
at System.Runtime.InteropServices.NativeLibrary.Load(String libraryPath)
at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 92
at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@98-2.Invoke(Assembly delegateArg0, String delegateArg1) in F:\workspace\_work\1\s\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 98
at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
at TorchSharp.Tensor.Float32Tensor.THSTensor_newFloat32Scalar(Single scalar, Int32 deviceType, Int32 deviceIndex, Boolean requiresGrad)
at TorchSharp.Tensor.Float32Tensor.from(Single scalar, DeviceType deviceType, Int32 deviceIndex, Boolean requiresGrad)
at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@1128-9.Invoke(Single v) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1128
at DiffSharp.Backends.Torch.TorchTensorOps`2.CreateFromFlatArray(Array values, Int32[] shape, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1120
at DiffSharp.Backends.Torch.TorchBackendTensorStatics.CreateFromFlatArray(Array values, Int32[] shape, Dtype dtype, Device device) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Backends.Torch/Torch.RawTensor.fs:line 1456
at DiffSharp.Backends.RawTensor.Create(Object values, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/RawTensor.fs:line 231
at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/Tensor.fs:line 817
at DiffSharp.dsharp.tensor(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 33
at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend) in /home/gunes/git/github/DiffSharp/DiffSharp/src/DiffSharp.Core/DiffSharp.fs:line 1101
at <StartupCode$FSI_0002>.$FSI_0002.main@()
Stopped due to error
So in the first machine I couldn’t get the nuget libtorch working. The problem seems to be libtorch_cuda_cu.so
. Also interestingly dsharp.devices
doesn’t list CUDA as available even when it’s working. This might need a separate issue on its own.
In the second machine I coulndn’t get nuget or external libtorch working. The problem seems to be missing GLIBC_2.30
and also probably having CUDA 11 installed instead of 11.1.
It would be fine to focus on the first machine and understand
- why the nuget libtorch didn’t work and
- why dsharp.devices doesn’t show CUDA with external libtorch but there are no errors with creating CUDA tensors (I wonder if the tensors are really created in CUDA or is it silently falling back to CPU?).
Both machines have dotnet sdk 5.0.201.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
It looks like
dsharp.devices()
doesn’t report the CUDA devices until the device is initialized.gives
There are two things discussed in this issue.
I think it makes sense to close this issue in favor of #271 where we identified libtorch loading and device initialization issues connected to TorchSharp https://github.com/dotnet/TorchSharp/issues/345 and in favor of #304 where some solutions to the API design are listed.