question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NCCL WARN Failed to open libibverbs.so[.1]

See original GitHub issue

I use the official example https://github.com/kubeflow/tf-operator/tree/master/examples/v1/distribution_strategy/keras-API

1.POD is working but use only one GPU and the second failed is 2. eval_fn is not passed in. The worker_fn will be used if an “evaluator” task exists in the cluster eval_fn is not passed in. The worker_fn will be used if an “evaluator” task exists in the cluster

System information

Ubuntu 16.04
1Master   IP:14X.XXX.XXX.1
node1     IP:14X.XXX.XXX.8     GTX1060
node2     IP:14X.XXX.XXX.9     GTX1060
node3     IP:14X.XXX.XXX.10    GTX1070
docker18.09.7-3
cuda 10.0
nvidia-container-runtime=2.0.0
kubernetes 1.5.7
kubeflow 1.01

This is pod logs

2020-06-06 13:42:10.403622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-06-06 13:42:10.405097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-06-06 13:42:11.360954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-06 13:42:11.392254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.392302: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.392337: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.394046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.394338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.396368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.397503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.397553: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.399221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.399588: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-06 13:42:11.407640: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198720000 Hz
2020-06-06 13:42:11.409315: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4216610 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-06 13:42:11.409351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-06 13:42:11.506014: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4a3ed20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-06 13:42:11.506070: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-06-06 13:42:11.507576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.507666: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.507688: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.507725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.507748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.507782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.507812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.507832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.510127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.510187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.871917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-06 13:42:11.871965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-06 13:42:11.871973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-06 13:42:11.873633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7169 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-06 13:42:11.877266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7715GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-06 13:42:11.877328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-06 13:42:11.877352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:11.877384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-06 13:42:11.877416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-06 13:42:11.877439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-06 13:42:11.877466: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-06 13:42:11.877493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-06 13:42:11.879798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-06-06 13:42:11.879840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-06 13:42:11.879856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-06-06 13:42:11.879866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-06-06 13:42:11.882065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 7169 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-06-06 13:42:11.890039: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> multi-worker-worker-1.kubeflow.svc:2222, 2 -> multi-worker-worker-2.kubeflow.svc:2222}
2020-06-06 13:42:11.891220: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.

multi-worker-worker-0:1:210 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.3.184<0>
multi-worker-worker-0:1:210 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

multi-worker-worker-0:1:210 [0] external/nccl_archive/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
2020-06-06 13:42:25.072418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-06 13:42:25.854184: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR
multi-worker-worker-0:1:292 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
multi-worker-worker-0:1:292 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
multi-worker-worker-0:1:292 [0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
multi-worker-worker-0:1:292 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  SYS
multi-worker-worker-0:1:292 [0] NCCL INFO Channel 00 :    0   1   2
multi-worker-worker-0:1:292 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
multi-worker-worker-0:1:292 [0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
multi-worker-worker-0:1:292 [0] NCCL INFO Ring 00 : 2 -> 0 [receive] via NET/Socket/0
multi-worker-worker-0:1:292 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
multi-worker-worker-0:1:292 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
multi-worker-worker-0:1:292 [0] NCCL INFO comm 0x7f5c38305e70 rank 0 nranks 3 cudaDev 0 nvmlDev 0 - Init COMPLETE
multi-worker-worker-0:1:291 [0] NCCL INFO Launch mode Parallel
2020-06-06 13:42:27.385638: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-06-06 13:42:27.385715: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-06-06 13:42:27.387084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2020-06-06 13:42:27.487586: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-06-06 13:42:27.488406: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                36928     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
_________________________________________________________________
Train for 70 steps
Epoch 1/10
2020-06-06 13:42:27.521448: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-06-06 13:42:27.521483: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
67/70 [===========================>..] - ETA: 0s - loss: 0.8933 - accuracy: 0.7327  
Learning rate for epoch 1 is 0.0010000000474974513
70/70 [==============================] - 9s 123ms/step - loss: 0.8707 - accuracy: 0.7398
Epoch 2/10
66/70 [===========================>..] - ETA: 0s - loss: 0.2171 - accuracy: 0.9326
Learning rate for epoch 2 is 0.0010000000474974513
70/70 [==============================] - 3s 44ms/step - loss: 0.2138 - accuracy: 0.9333
Epoch 3/10
67/70 [===========================>..] - ETA: 0s - loss: 0.1507 - accuracy: 0.9527
Learning rate for epoch 3 is 0.0010000000474974513
70/70 [==============================] - 3s 49ms/step - loss: 0.1503 - accuracy: 0.9530
Epoch 4/10
65/70 [==========================>...] - ETA: 0s - loss: 0.1092 - accuracy: 0.9683
Learning rate for epoch 4 is 9.999999747378752e-05
70/70 [==============================] - 3s 48ms/step - loss: 0.1098 - accuracy: 0.9685
Epoch 5/10
67/70 [===========================>..] - ETA: 0s - loss: 0.1067 - accuracy: 0.9702
Learning rate for epoch 5 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.1068 - accuracy: 0.9702
Epoch 6/10
69/70 [============================>.] - ETA: 0s - loss: 0.0914 - accuracy: 0.9709
Learning rate for epoch 6 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.0909 - accuracy: 0.9711
Epoch 7/10
67/70 [===========================>..] - ETA: 0s - loss: 0.0889 - accuracy: 0.9722
Learning rate for epoch 7 is 9.999999747378752e-05
70/70 [==============================] - 4s 52ms/step - loss: 0.0882 - accuracy: 0.9724
Epoch 8/10
67/70 [===========================>..] - ETA: 0s - loss: 0.0907 - accuracy: 0.9732
Learning rate for epoch 8 is 9.999999747378752e-06
70/70 [==============================] - 4s 52ms/step - loss: 0.0908 - accuracy: 0.9730
Epoch 9/10
66/70 [===========================>..] - ETA: 0s - loss: 0.0968 - accuracy: 0.9704
Learning rate for epoch 9 is 9.999999747378752e-06
70/70 [==============================] - 4s 54ms/step - loss: 0.0955 - accuracy: 0.9710
Epoch 10/10
69/70 [============================>.] - ETA: 0s - loss: 0.0858 - accuracy: 0.9736
Learning rate for epoch 10 is 9.999999747378752e-06
70/70 [==============================] - 4s 53ms/step - loss: 0.0860 - accuracy: 0.9736
2020-06-06 13:43:03.252370: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-06-06 13:43:03.625375: W tensorflow/python/util/util.cc:319] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1786: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-06-06 13:43:05.762394: W tensorflow/core/common_runtime/eager/context.cc:349] Unable to destroy server_ object, so releasing instead. Servers don't support clean shutdown.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:20 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
gaocegegecommented, Jun 10, 2020

Hi, did you solve the problem?

0reactions
LearnKencommented, Jun 8, 2020

It traing but it still one gpu using Why…

Read more comments on GitHub >

github_iconTop Results From Across the Web

NCCL WARN Failed to open libibverbs.so[.1] #12219
Trainer( gpus=[0,1], strategy='ddp', .... When I try to train it just stops. So set env ...
Read more >
A6000 NCCL WARN Failed to open libibverbs.so - distributed
I get the error misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]. I just upgraded my 2x (Titan RTX with nvlink) to 2x (A600 ......
Read more >
How (Not) To Scale Deep Learning in 6 Easy Steps
This notebooks walks through training an image classifier on the Caltech 256 dataset in a way that illustrates pitfalls and their solutions, and...
Read more >
SDS - LaMaStEx
This function takes in rank and size arguments so it can be used for ... misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] [1 ......
Read more >
Proxy Thread Error in NCCL after building from source ...
You received this message because you are subscribed to the Google Groups "Discuss" group. To unsubscribe from this group and stop receiving emails...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found