Add section on loss of device troubleshooting
See original GitHub issueIt would be great to get a section on how to troubleshoot a loss of device error. This is an issue that stumps many people when working with vulkan for the first time. Some things that could be included that would be helpful:
- How to use
vkCmdSetCheckpointNV
/vkGetQueueCheckpointDataNV
/vkGetQueueCheckpointData2NV
for NVIDIA - How to use
vkCmdWriteBufferMarkerAMD
/vkCmdWriteBufferMarker2AMD
for AMD (I’m not even sure how to use these, I have only used the nvidia ones)
Useful link: https://vulkan.lunarg.com/doc/view/1.2.162.1/mac/chunked_spec/chap45.html#_device_loss_debugging
Here is a sample snippet of an implementation I hacked together for the project I am working on. It does not work perfect, especially on older cards (I’m thinking that older drivers have issues with the set checkpoint). Perhaps it may be of use in getting something put together for the book.
public static String[] getOptionalDebuggingDeviceExtensions() {
return debugCommandBuffers ?
new String[] {
//NVIDIA: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_NV_device_diagnostic_checkpoints.html
VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME,
//AMD: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_AMD_buffer_marker.html
//TODO: VK_AMD_BUFFER_MARKER_EXTENSION_NAME,
//Optional: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_synchronization2.html
VK_KHR_SYNCHRONIZATION_2_EXTENSION_NAME, //Disable for nvidia NSIGHT
}
: new String[0];
}
public static void debugLostDeviceNV2(Vk12Queue queue) {
if (!debugCommandBuffers) {
return;
}
if (queue.getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME) && queue.getDevice().getExtensions().contains(VK_KHR_SYNCHRONIZATION_2_EXTENSION_NAME)) {
try (MemoryStack stack = MemoryStack.stackPush()) {
IntBuffer count = stack.callocInt(1);
vkGetQueueCheckpointData2NV(queue.getQueue(), count, null);
if (count.get(0) > 0) {
VkCheckpointData2NV.Buffer checkpointData = VkCheckpointData2NV.calloc(count.get(0));
vkGetQueueCheckpointData2NV(queue.getQueue(), count, checkpointData);
List<String> checkpoints = new ArrayList<>(count.get(0));
for (VkCheckpointData2NV checkpoint : checkpointData.stream().toList()) {
checkpoints.add(commandBufferCheckpoints.get((int) checkpoint.pCheckpointMarker() - 1));
}
LOGGER.severe("Device lost on command buffer checkpoint: \n\t* " + String.join("\n\t* ", checkpoints));
} else {
debugLostDeviceNV(queue);
}
}
} else {
debugLostDeviceNV(queue);
}
}
public static void debugLostDeviceNV(Vk12Queue queue) {
if (!debugCommandBuffers) {
return;
}
if (queue.getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME)) {
try (MemoryStack stack = MemoryStack.stackPush()) {
IntBuffer count = stack.callocInt(1);
vkGetQueueCheckpointDataNV(queue.getQueue(), count, null);
if (count.get(0) > 0) {
VkCheckpointDataNV.Buffer checkpointData = VkCheckpointDataNV.calloc(count.get(0));
vkGetQueueCheckpointDataNV(queue.getQueue(), count, checkpointData);
List<String> checkpoints = new ArrayList<>(count.get(0));
for (VkCheckpointDataNV checkpoint : checkpointData.stream().toList()) {
checkpoints.add(commandBufferCheckpoints.get((int) checkpoint.pCheckpointMarker() - 1));
}
LOGGER.severe("Device lost on command buffer checkpoint: " + String.join("\n\t* ", checkpoints));
} else {
LOGGER.warning("No command buffer checkpoints recorded for device lost error.");
}
}
} else {
debugLostDeviceAMD(queue);
}
}
public static void debugLostDeviceAMD(Vk12Queue queue) {
if (!debugCommandBuffers) {
return;
}
if (queue.getDevice().getExtensions().contains(VK_AMD_BUFFER_MARKER_EXTENSION_NAME)) {
//TODO: ???
} else {
LOGGER.warning("Cannot debug lost device, missing device extensions.");
}
}
public static void debugCommandBuffer(Vk12CommandBuffer buffer, String commandBufferCheckpoint) {
if (!debugCommandBuffers) {
return;
}
if (buffer.getCommandPool().getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME)) {
long checkpointMarker;
if (!commandBufferCheckpoints.contains(commandBufferCheckpoint)) {
commandBufferCheckpoints.add(commandBufferCheckpoint);
}
checkpointMarker = commandBufferCheckpoints.indexOf(commandBufferCheckpoint);
vkCmdSetCheckpointNV(buffer.getCommandBuffer(), checkpointMarker + 1);
} else if (buffer.getCommandPool().getDevice().getExtensions().contains(VK_AMD_BUFFER_MARKER_EXTENSION_NAME)) {
//TODO: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdWriteBufferMarkerAMD.html
//TODO: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdWriteBufferMarker2AMD.html
}
}
Thanks, Trevor
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Troubleshooting Packet Loss between Devices - Cisco Meraki
This section will outline some of the more common reasons packet loss occurs and what can be done about them. Duplex Mismatch. This...
Read more >Troubleshooting Process (4.2) > Preventive Maintenance and ...
In this section, you will learn that to troubleshoot a problem quickly and effectively, you need to understand how to approach the issue....
Read more >Fix problems with the Start menu - Microsoft Support
Learn more about how to fix problems with the Windows Start menu, and what to do if the Start menu won't open. ......
Read more >Troubleshoot Switch Port and Interface Problems - Cisco
Description: show interfaces counter. The number of times the carrier was lost in transmission. Common Causes: Check for a bad cable. Check the ......
Read more >A Guide to Network Troubleshooting - CompTIA
Determine if anything has changed in the network before the issues appeared. Is there a new piece of hardware that's in use? Has...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi,
I’ve just corrected the code to call ``vkGetQueueCheckpointDataNV
. I'm just testing inserting a marker and after that calling to dump the results (not provoked any loss of device yet). What I see is that I goet the check point twice at different stages:
VK_PIPELINE_STAGE_TOP_OF_PIPE_BITand
VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT`. It seems that the check point is marked for each stage it gets executed. To be honest, the specifciation is not clear if this is the normal case or not.I will try to provoke a device loss and check what I get.
I’ve just uploaded a draft version. Please have a look and come back with commients (if any)