question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SegmentationFault error related PinnedMemoryManager

See original GitHub issue

Description A clear and concise description of what the bug is.

I encountered a segmentation fault repeatedly.

Triton Information What version of Triton are you using? 2.4.0

Are you using the Triton container or did you build it yourself? NGC container:20.11-py3

To Reproduce Steps to reproduce the behavior.

I have no idea how to reproduce this error since it’ barely happens. But I got two core dump files related to it.

  • <span>core dump stacktrace 1</span>

    root@api-inference-deployment-69cf64fb74-phzcd:/opt/tritonserver# gdb bin/tritonserver /var/crash/core.tritonserver.1.api-inference-deployment-69cf64fb74-phzcd                                                                   [42/42]
    GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
    Reading symbols from bin/tritonserver...(no debugging symbols found)...done.
    [New LWP 15]
    ....
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `tritonserver --id=api-inference-20201201115135 --model-repository=/model_repo -'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x00007fa09da86016 in nvidia::inferenceserver::PinnedMemoryManager::AllocInternal(void**, unsigned long, TRITONSERVER_memorytype_enum*, bool) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    [Current thread is 1 (Thread 0x7f9fb7fff700 (LWP 15))]
    (gdb) bt
    #0  0x00007fa09da86016 in nvidia::inferenceserver::PinnedMemoryManager::AllocInternal(void**, unsigned long, TRITONSERVER_memorytype_enum*, bool) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #1  0x00007fa09da874a3 in nvidia::inferenceserver::PinnedMemoryManager::Alloc(void**, unsigned long, TRITONSERVER_memorytype_enum*, bool) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #2  0x00007fa09da4c54f in nvidia::inferenceserver::AllocatedMemory::AllocatedMemory(unsigned long, TRITONSERVER_memorytype_enum, long) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #3  0x00007fa09da0cfef in nvidia::inferenceserver::(anonymous namespace)::EnsembleContext::ResponseAlloc(TRITONSERVER_ResponseAllocator*, char const*, unsigned long, TRITONSERVER_memorytype_enum, long, void*, void**, void**, TRITONSERVER_memorytype_enum*, long*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #4  0x00007fa09da44292 in nvidia::inferenceserver::InferenceResponse::Output::AllocateDataBuffer(void**, unsigned long, TRITONSERVER_memorytype_enum*, long*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #5  0x00007fa09dbefcb5 in TRITONBACKEND_OutputBuffer () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #6  0x00007f9fe056029d in triton::backend::dali::detail::AllocateOutputs(TRITONBACKEND_Request*, TRITONBACKEND_Response*, std::vector<triton::backend::dali::shape_and_type_t, std::allocator<triton::backend::dali::shape_and_type_t> > const&) () from /opt/tritonserver/backends/dali/libtriton_dali.so
    #7  0x00007f9fe05606ab in triton::backend::dali::detail::ProcessRequest(TRITONBACKEND_Response*, TRITONBACKEND_Request*, triton::backend::dali::DaliExecutor&) () from /opt/tritonserver/backends/dali/libtriton_dali.so
    #8  0x00007f9fe05609e6 in TRITONBACKEND_ModelInstanceExecute () from /opt/tritonserver/backends/dali/libtriton_dali.so
    #9  0x00007fa09dbf2c2a in std::_Function_handler<void (unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&), nvidia::inferenceserver::TritonModel::Create(nvidia::inferenceserver::InferenceServer*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, inference::ModelConfig const&, std::unique_ptr<nvidia::inferenceserver::TritonModel, std::default_delete<nvidia::inferenceserver::TritonModel> >*)::{lambda(unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&)#2}>::_M_invoke(std::_Any_data const&, unsigned int&&, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #10 0x00007fa09d9fd3c1 in nvidia::inferenceserver::DynamicBatchScheduler::SchedulerThread(unsigned int, int, std::shared_ptr<std::atomic<bool> > const&, std::promise<bool>*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #11 0x00007fa09c8e56df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #12 0x00007fa09d52d6db in start_thread (arg=0x7f9fb7fff700) at pthread_create.c:463
    #13 0x00007fa09bfa2a3f in epoll_wait (epfd=-1207961856, events=0x0, maxevents=-1208037376, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
    #14 0x0000000000000000 in ?? ()
    (gdb)
    

  • <span>core dump stacktrace 2</span>

    root@api-inference-deployment-69cf64fb74-rk2t8:/opt/tritonserver# gdb bin/tritonserver /var/crash/core.tritonserver.1.api-inference-deployment-69cf64fb74-rk2t8
    GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
    Reading symbols from bin/tritonserver...(no debugging symbols found)...done.
    [New LWP 38]
    ...
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `tritonserver --id=api-inference-20201201115135 --model-repository=/model_repo -'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x00007fa37b302cc4 in boost::intrusive::rbtree_algorithms<boost::intrusive::rbtree_node_traits<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, true> >::rebalance_after_insertion(boost::interprocess::offset_ptr<boost::intrusive::compact_rbtree_node<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul> >, long, unsigned long, 0ul> const&, boost::interprocess::offset_ptr<boost::intrusive::compact_rbtree_node<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul> >, long, unsigned long, 0ul>) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    [Current thread is 1 (Thread 0x7fa2b17fe700 (LWP 38))]
    (gdb) bt
    #0  0x00007fa37b302cc4 in boost::intrusive::rbtree_algorithms<boost::intrusive::rbtree_node_traits<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, true> >::rebalance_after_insertion(boost::interprocess::offset_ptr<boost::intrusive::compact_rbtree_node<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul> >, long, unsigned long, 0ul> const&, boost::interprocess::offset_ptr<boost::intrusive::compact_rbtree_node<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul> >, long, unsigned long, 0ul>) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #1  0x00007fa37b3046b3 in boost::intrusive::bstree_impl<boost::intrusive::bhtraits<boost::interprocess::rbtree_best_fit<boost::interprocess::null_mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>::block_ctrl, boost::intrusive::rbtree_node_traits<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, true>, (boost::intrusive::link_mode_type)0, boost::intrusive::dft_tag, 3u>, void, void, unsigned long, true, (boost::intrusive::algo_types)5, void>::insert_equal(boost::intrusive::tree_iterator<boost::intrusive::bhtraits<boost::interprocess::rbtree_best_fit<boost::interprocess::null_mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>::block_ctrl, boost::intrusive::rbtree_node_traits<boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, true>, (boost::intrusive::link_mode_type)0, boost::intrusive::dft_tag, 3u>, true>, boost::interprocess::rbtree_best_fit<boost::interprocess::null_mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>::block_ctrl&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #2  0x00007fa37b305067 in boost::interprocess::rbtree_best_fit<boost::interprocess::null_mutex_family, boost::interprocess::offset_ptr<void, long, unsigned long, 0ul>, 0ul>::priv_deallocate(void*) ()
       from /opt/tritonserver/bin/../lib/libtritonserver.so
    #3  0x00007fa37b2fd9e4 in nvidia::inferenceserver::PinnedMemoryManager::FreeInternal(void*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #4  0x00007fa37b2fdc59 in nvidia::inferenceserver::PinnedMemoryManager::Free(void*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #5  0x00007fa37b2c4150 in nvidia::inferenceserver::AllocatedMemory::~AllocatedMemory() () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #6  0x00007fa37b263d62 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, nvidia::inferenceserver::InferenceRequest::Input>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, nvidia::inferenceserver::InferenceRequest::Input> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #7  0x00007fa37b33652c in nvidia::inferenceserver::InferenceRequest::~InferenceRequest() () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #8  0x00007fa37b32f67e in TRITONSERVER_InferenceRequestDelete () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #9  0x00007fa37b283843 in nvidia::inferenceserver::(anonymous namespace)::EnsembleContext::RequestComplete(TRITONSERVER_InferenceRequest*, unsigned int, void*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #10 0x00007fa37b2b0838 in nvidia::inferenceserver::InferenceRequest::Release(std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >&&, unsigned int) ()
       from /opt/tritonserver/bin/../lib/libtritonserver.so
    #11 0x00007fa37b445362 in nvidia::inferenceserver::LibTorchBackend::Context::Run(nvidia::inferenceserver::InferenceBackend*, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #12 0x00007fa37b25ba30 in nvidia::inferenceserver::InferenceBackend::Run(unsigned int, std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_delete<nvidia::inferenceserver::InferenceRequest> > > >&&) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #13 0x00007fa37b2753c1 in nvidia::inferenceserver::DynamicBatchScheduler::SchedulerThread(unsigned int, int, std::shared_ptr<std::atomic<bool> > const&, std::promise<bool>*) () from /opt/tritonserver/bin/../lib/libtritonserver.so
    #14 0x00007fa37a15d6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    #15 0x00007fa37ada56db in start_thread (arg=0x7fa2b17fe700) at pthread_create.c:463
    #16 0x00007fa37981aa3f in epoll_wait (epfd=-1317017856, events=0x0, maxevents=-1317093376, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
    #17 0x0000000000000000 in ?? ()
    (gdb)
     

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Currently using ensemble with DALI and libtorch backend. But it seems not relevant.

# Ensemble
ensemble([DALI, libtorch])(X) -> Y
X = (dtype: UINT8, dims: [-1])  # jpeg bytes
Y = [(dtype: FP32, dims: [-1, 4]),  # bbox
     (dtype: FP32, dims: [-1, 2]),  # class
     (dtype: FP32, dims: [-1, 10]),  # landmark
    ]

# pre-proc
DALI(X) -> Y
X = (dtype: UINT8, dims: [-1])  # jpeg bytes
Y = (dtype: FP32, dims: [3, -1, -1])

# a typical landmark detection model
libtorch(X) -> Y
X = (dtype: FP32, dims: [3, -1, -1])
Y = [(dtype: FP32, dims: [-1, 4]),  # bbox
     (dtype: FP32, dims: [-1, 2]),  # class
     (dtype: FP32, dims: [-1, 10]),  # landmark
    ]

Expected behavior A clear and concise description of what you expected to happen.

No segfault.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
philipp-schmidtcommented, Jan 26, 2021

If I understood correctly, as per @deadeyegoodwin’s https://github.com/triton-inference-server/server/issues/2135#issuecomment-767240946, the fix for the race condition did not yet make it into the NGC container. You need to build and check on master branch.

1reaction
GuanLuocommented, Jan 6, 2021

The log suggests that you have a TensorRT model, is that also part of the ensemble? I ask because we recently fixed an issue of using pinned memory buffer in TensorRT backend.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault occurs when calling function in the Pin tool
I am currently building a Pin tool which detects uninitialized reads from Linux application, based on this blog post.
Read more >
Identify what's causing segmentation faults (segfaults)
A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core...
Read more >
Common Causes of Segmentation Faults (Segfaults)
A segmentation fault (often called a segfault) can occur if a program you are running attempts to access an invalid memory location.
Read more >
Segmentation fault in executing training program - Jetson TX1
I am more concerned about the segmentation fault error although the memory is not utilized at full and also swap is not used...
Read more >
Determining Root Cause of Segmentation Faults SIGSEGV or ...
If your application still generates SIGSEGV or SIGBUS error, continue reading. ... In the process memory map, heap and stack grow towards each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found