question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add section on loss of device troubleshooting

See original GitHub issue

It would be great to get a section on how to troubleshoot a loss of device error. This is an issue that stumps many people when working with vulkan for the first time. Some things that could be included that would be helpful:

  • How to use vkCmdSetCheckpointNV/vkGetQueueCheckpointDataNV/vkGetQueueCheckpointData2NV for NVIDIA
  • How to use vkCmdWriteBufferMarkerAMD/vkCmdWriteBufferMarker2AMD for AMD (I’m not even sure how to use these, I have only used the nvidia ones)

Useful link: https://vulkan.lunarg.com/doc/view/1.2.162.1/mac/chunked_spec/chap45.html#_device_loss_debugging

Here is a sample snippet of an implementation I hacked together for the project I am working on. It does not work perfect, especially on older cards (I’m thinking that older drivers have issues with the set checkpoint). Perhaps it may be of use in getting something put together for the book.

public static String[] getOptionalDebuggingDeviceExtensions() {
        return debugCommandBuffers ?
                new String[] {
                        //NVIDIA: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_NV_device_diagnostic_checkpoints.html
                        VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME,

                        //AMD: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_AMD_buffer_marker.html
                        //TODO: VK_AMD_BUFFER_MARKER_EXTENSION_NAME,

                        //Optional: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VK_KHR_synchronization2.html
                        VK_KHR_SYNCHRONIZATION_2_EXTENSION_NAME, //Disable for nvidia NSIGHT
                }
                : new String[0];
    }

    public static void debugLostDeviceNV2(Vk12Queue queue) {
        if (!debugCommandBuffers) {
            return;
        }
        if (queue.getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME) && queue.getDevice().getExtensions().contains(VK_KHR_SYNCHRONIZATION_2_EXTENSION_NAME)) {
            try (MemoryStack stack = MemoryStack.stackPush()) {
                IntBuffer count = stack.callocInt(1);
                vkGetQueueCheckpointData2NV(queue.getQueue(), count, null);

                if (count.get(0) > 0) {
                    VkCheckpointData2NV.Buffer checkpointData = VkCheckpointData2NV.calloc(count.get(0));
                    vkGetQueueCheckpointData2NV(queue.getQueue(), count, checkpointData);
                    List<String> checkpoints = new ArrayList<>(count.get(0));
                    for (VkCheckpointData2NV checkpoint : checkpointData.stream().toList()) {
                        checkpoints.add(commandBufferCheckpoints.get((int) checkpoint.pCheckpointMarker() - 1));
                    }
                    LOGGER.severe("Device lost on command buffer checkpoint: \n\t* " + String.join("\n\t* ", checkpoints));
                } else {
                    debugLostDeviceNV(queue);
                }
            }
        } else {
            debugLostDeviceNV(queue);
        }
    }

    public static void debugLostDeviceNV(Vk12Queue queue) {
        if (!debugCommandBuffers) {
            return;
        }
        if (queue.getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME)) {
            try (MemoryStack stack = MemoryStack.stackPush()) {
                IntBuffer count = stack.callocInt(1);
                vkGetQueueCheckpointDataNV(queue.getQueue(), count, null);

                if (count.get(0) > 0) {
                    VkCheckpointDataNV.Buffer checkpointData = VkCheckpointDataNV.calloc(count.get(0));
                    vkGetQueueCheckpointDataNV(queue.getQueue(), count, checkpointData);
                    List<String> checkpoints = new ArrayList<>(count.get(0));
                    for (VkCheckpointDataNV checkpoint : checkpointData.stream().toList()) {
                        checkpoints.add(commandBufferCheckpoints.get((int) checkpoint.pCheckpointMarker() - 1));
                    }
                    LOGGER.severe("Device lost on command buffer checkpoint: " + String.join("\n\t* ", checkpoints));
                } else {
                    LOGGER.warning("No command buffer checkpoints recorded for device lost error.");
                }
            }
        } else {
            debugLostDeviceAMD(queue);
        }
    }

    public static void debugLostDeviceAMD(Vk12Queue queue) {
        if (!debugCommandBuffers) {
            return;
        }
        if (queue.getDevice().getExtensions().contains(VK_AMD_BUFFER_MARKER_EXTENSION_NAME)) {
            //TODO: ???
        } else {
            LOGGER.warning("Cannot debug lost device, missing device extensions.");
        }
    }

    public static void debugCommandBuffer(Vk12CommandBuffer buffer, String commandBufferCheckpoint) {
        if (!debugCommandBuffers) {
            return;
        }
        if (buffer.getCommandPool().getDevice().getExtensions().contains(VK_NV_DEVICE_DIAGNOSTIC_CHECKPOINTS_EXTENSION_NAME)) {
            long checkpointMarker;
            if (!commandBufferCheckpoints.contains(commandBufferCheckpoint)) {
                commandBufferCheckpoints.add(commandBufferCheckpoint);
            }
            checkpointMarker = commandBufferCheckpoints.indexOf(commandBufferCheckpoint);
            vkCmdSetCheckpointNV(buffer.getCommandBuffer(), checkpointMarker + 1);
        } else if (buffer.getCommandPool().getDevice().getExtensions().contains(VK_AMD_BUFFER_MARKER_EXTENSION_NAME)) {
            //TODO: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdWriteBufferMarkerAMD.html
            //TODO: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdWriteBufferMarker2AMD.html
        }
    }

Thanks, Trevor

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
lwjglgamedevcommented, Dec 30, 2021

Hi,

I’ve just corrected the code to call ``vkGetQueueCheckpointDataNV. I'm just testing inserting a marker and after that calling to dump the results (not provoked any loss of device yet). What I see is that I goet the check point twice at different stages: VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT andVK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT`. It seems that the check point is marked for each stage it gets executed. To be honest, the specifciation is not clear if this is the normal case or not.

I will try to provoke a device loss and check what I get.

1reaction
lwjglgamedevcommented, Dec 30, 2021

I’ve just uploaded a draft version. Please have a look and come back with commients (if any)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting Packet Loss between Devices - Cisco Meraki
This section will outline some of the more common reasons packet loss occurs and what can be done about them. Duplex Mismatch. This...
Read more >
Troubleshooting Process (4.2) > Preventive Maintenance and ...
In this section, you will learn that to troubleshoot a problem quickly and effectively, you need to understand how to approach the issue....
Read more >
Fix problems with the Start menu - Microsoft Support
Learn more about how to fix problems with the Windows Start menu, and what to do if the Start menu won't open. ......
Read more >
Troubleshoot Switch Port and Interface Problems - Cisco
Description: show interfaces counter. The number of times the carrier was lost in transmission. Common Causes: Check for a bad cable. Check the ......
Read more >
A Guide to Network Troubleshooting - CompTIA
Determine if anything has changed in the network before the issues appeared. Is there a new piece of hardware that's in use? Has...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found