Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DXGI_ERROR_DEVICE_REMOVED Error

See original GitHub issue

hello, i have a problem. I don’t know if anyone has had this problem. I have a Vega8, the drivers are all installed correctly but it is giving the error DXGI_ERROR_DEVICE_REMOVED when I try to run the following script.

import tensorflow.compat.v1 as tf
tf.enable_eager_execution (tf.ConfigProto (log_device_placement = True))
print (tf.add ([1.0, 2.0], [3.0, 4.0]))

I’ve already followed the instructions on the link https://aka.ms/tfdmltimeout but it doesn’t work.

2021-03-31 11: 29: 36.810513: I tensorflow / stream_executor / platform / default / dso_loader.cc: 98] Successfully opened dynamic library C: \ Users \ d.belgd \ Miniconda3 \ envs \ directml2 \ lib \ site-packages \ tensorflow_core \ python / directml.bdb07c797e1e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 11: 29: 36.917148: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 126] DirectML device enumeration: found 1 compatible adapters.
[PhysicalDevice (name = '/ physical_device: DML: 0', device_type = 'DML')]
2021-03-31 11: 29: 36.920996: I tensorflow / core / platform / cpu_feature_guard.cc: 142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 11: 29: 36.925428: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 109] DirectML: creating device on adapter 0 (AMD Radeon (TM) Vega 8 Graphics)
2021-03-31 11: 29: 37.129830: And tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED). This is most often caused by a timeout occurring on t the GPU. Please visit https://aka.ms/tfdmltimeout for more information and troubleshooting steps.
2021-03-31 11: 29: 37.136448: F tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] HRESULT failed with 0x887a0005: hr

I think this is the problem when I try to run

python detect_video.py --video data/grca-trainmix_1280x720.mp4 --trace --max_frames 10 --headless

WARNING:tensorflow:From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

W0331 13:51:28.546197  3820 module_wrapper.py:139] From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

2021-03-31 13:51:28.806023: I tensorflow/stream_executor/platform/default/dso_loader.cc:98] Successfully opened dynamic library C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python/directml.bdb07c797e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 13:51:28.933164: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:126] DirectML device enumeration: found 1 compatible adapters.
2021-03-31 13:51:28.936741: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 13:51:28.940855: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:109] DirectML: creating device on adapter 0 (AMD Radeon(TM) Vega 8 Graphics)
WARNING:tensorflow:From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

W0331 13:51:29.155223  3820 module_wrapper.py:139] From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

WARNING:tensorflow:From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0331 13:51:29.190702  3820 deprecation.py:506] From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Traceback (most recent call last):
  File "detect_video.py", line 148, in <module>
    app.run(main)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 303, in run
    _run_main(main, args)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "detect_video.py", line 65, in main
    yolo.load_weights(FLAGS.weights)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 182, in load_weights
    return super(Model, self).load_weights(filepath, by_name)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1339, in load_weights
    pywrap_tensorflow.NewCheckpointReader(filepath)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 877, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 889, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./checkpoints/yolov3.tf: Not found: FindFirstFile failed for: ./checkpoints : The system cannot find the path specified.
; No such process

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jstoeckercommented, Apr 1, 2021

In short: yes, DirectML supports access to dedicated memory!

DirectML itself doesn’t allocate memory for GPU resources: that’s up to the application/framework using it, such as TensorFlow-DirectML (TFDML) in this case. TFDML has a number of allocators for different purposes, but the bulk of the memory (to store the tensors used in GPU calculations) will be backed by subregions of a so-called default heap. Default heaps reflect different memory pools based on the GPU architecture (UMA or NUMA/discrete).

Your Radeon Vega 8 is an integrated GPU, so the 2GB of dedicated memory you see isn’t physical VRAM but rather reserved system memory. In other words, your system actually has 8GB of RAM, but the integrated GPU is claiming 2GB of it for exclusive access. This blog explains some of the differences between dedicated and shared memory, how they are reported in task manager, and some differences between discrete and integrated GPUs in this respect.

Integrated GPUs are, unfortunately, not going to be particularly fast in machine learning. It’s worth pointing out that we haven’t really optimized TFDML for integrated GPUs (e.g. we could avoid some memory copies since default-heap resources will always live in the “L0” memory pool); however, it’s unlikely that you’ll see huge performance gains over the CPU without using a more powerful discrete GPU.

0reactions

douglastehlingcommented, Apr 1, 2021

@jstoecker and @adtsai really with the memory allocation it worked, now one thing I saw, was that detect-video.py is using shared memory and not dedicated memory. Do you know that directml supports access to dedicated memory? I ask this because the detection of the objects is very slow

Top Results From Across the Web

How to Fix DXGI_ERROR_DEVICE_REMOVED on Windows ...

It's reported that the DXGI ERROR DEVICE REMOVED error usually occurs when the graphics card runs improperly. In addition, some users find some ......

DXGI ERROR DEVICE REMOVED | NVIDIA GeForce Forums

If you are using WX you might try completely uninstalling your card via the device manager and letting Windows redetect it and then...

DXGI ERROR DEVICE REMOVED Error in Windows 10 / 11 Fix

DXGI_ERROR_DEVICE_REMOVED error occurs when the graphics card on your system isn't running properly or there is some connection issue on ...

DXGI ERROR DEVICE REMOVED - HOW TO FIX IT?

The DXGI ERROR DEVICE REMOVED error may be a Direct X error and is connected to the graphics (video) card. The device will...

How to Fix DXGI_ERROR_DEVICE_REMOVED on Windows ...

In games like GeForce Experience, shadow play is a hardware acceleration feature useful for screen recording. You can remove the DXGI error by ......