Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DISCUSS][RFC] Static Destruction Order Problem

See original GitHub issue

A crash occurs when exiting the process when at least one vulkan device has been constructed. This appears to be due to the way static destruction order interacts with library unloading order on Windows (a similar issue appeared previously with NNVM).

The core of the previous issue, and seemingly the same issue here, is that when using static destructors in libraries, it is easy to run into a situation where one library is calling into another after the other has been unloaded. I suspect in this case, the vulkan library is unloaded before TVM’s static destructors are invoked, leading to a crash when trying to destroy vulkan devices.

The crash occurs here, on the call to vkDestroyDevice: https://github.com/dmlc/tvm/blob/master/src/runtime/vulkan/vulkan_device_api.cc#L16

We can check that this is due to the destruction/library unload order by forcing this destructor to be called before any libraries are unloaded, by calling it at the end of main(). This requires a small modification to the destructor so that it won’t fail when called twice:

VulkanWorkspace::~VulkanWorkspace() {
  for (VulkanContext& ctx : context_) {
    vkDestroyDevice(ctx.device, nullptr);
  }
  if (instance_ != nullptr) {
    vkDestroyInstance(instance_, nullptr);
    instance_ = nullptr;
  }
}

and then making the following call at the end of main()

  TVMContext vulkan_ctx{ (DLDeviceType)kDLVulkan, 0 };
  tvm::runtime::DeviceAPI::Get(vulkan_ctx)->~DeviceAPI();

stops the issue from occurring entirely.

However, this is quite a hacky solution, and doing this from python is even more cumbersome, especially if the process doesn’t exit cleanly.

Is there a better way of handling this cleanup that can prevent these issues?

More generally, there may be a wider issue of relying on lifetimes of static variables that keeps manifesting in this kind of issue. Is there a better way of handling library-wide lifetimes that could prevent this as the project grows? The most direct solution I can think of is requiring library clients to call a cleanup function when they are done with the library, eg TVMDestroy (for C++ this could have a RAII wrapper that is constructed in main()). This is more onerous, but would guarantee that the cleanup code is always run before any libraries have been unloaded in the course of shutting down the process. It could also be used to allow other libraries like NNVM and TOPI to register their own cleanup functions with TVM, and TVM can ensure that their cleanup happens first.

I’d be interested to hear any thoughts on how this specific issue can be resolved easily, and more generally this type of destructor/library unload issue in the future.

Issue Analytics

State:
Created 5 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

alex-weavercommented, Jul 1, 2018

Ah - I wasn’t specific before but the error raised when calling vkDestroyDevice is an access violation, not a catchable exception. This was because the vuklan API library had been unloaded by the OS before TVM’s static destructors were called.

This means that any cross library function calls in static destructors can potentially cause an un-catchable error, and I’m not sure there is a safe way to determine if a library you depend on has been unloaded when the static destructor runs.

As far as I can see the only way to prevent this is to require an eager destructor for any destructor that may make cross-library calls (which could be as subtle as releasing a reference to a registered shared_ptr) in static destructors.

1reaction

tqchencommented, Jul 1, 2018

Unfortunately, static destruction order problem does occur in certain cases. There are several ways to alleviate this problem

For resources that we can control, always obtain a shared_ptr of a global singleton you dependent on(so destruction happens after that).
I do agree that having an eager destructor could help in certain cases, as long as the destructor itself is idempotent (if it is called twice the second time have no effect). Let us say this function is called TVMFinalize. In cases when this function is not called, the destructor will still be called automatically.