question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Changing batch size between requests with shared memory fails

See original GitHub issue

When running two clients with different batch sizes after each other, I get server errors on running inference (using shared memory) regarding the expected byte sizes.

Everything works well as long as batch sizes are consistent between clients or when the client is the very first client, but whenever a client has already set a batch size on the server before and the batch size differs for the currently requested one, I run into something similar to this:

Client error:

Server error message: expected buffer size to be 2945760bytes but gets 5891520 bytes in output tensor

Server error:

[trtserver.cc:1212] Infer failed: expected buffer size to be 2945760bytes but gets 5891520 bytes in output tensor

This happens when trying batch size 16, after another client successfully ran batch size 8, is done and unregistered his shared memory. So registering the shared memory with the new size seems to be okay, but running inference fails. All clients use the same model of course.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
CoderHamcommented, Sep 12, 2019

Fixed the issue with GRPC failure when using different batch sizes with the aforementioned PR. Thank you for bringing it to our attention.

0reactions
CoderHamcommented, Sep 10, 2019

Thanks @philipp-schmidt. Just re-visited this. (Had earlier tested with HTTP and it worked fine) You were right. There was an issue on the GRPC server. Fixed in grpc_server.cc.

Summary of change here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Changing batch size between requests with shared memory fails
This happens when trying batch size 16, after another client successfully ran batch size 8, is done and unregistered his shared memory. So ......
Read more >
Shared Memory Problem (unable to allocate ... - Ask TOM
What do you recommend. The application is using Bind Variables. What should I do ? A) Increase the shared pool size by around...
Read more >
A batch too large: Finding the batch size that fits on GPUs
A simple function to identify the batch size for your PyTorch model that can fill the GPU memory.
Read more >
Tensorflow memory needed doesn't scale with batch size and ...
After you run sess.run, you alloc some new memory to store the new tensor result, but after adding the new alloc memory, the...
Read more >
increase pytorch shared memory | Data Science and ... - Kaggle
I get RuntimeError: DataLoader worker (pid 173) is killed by signal: Bus error. I should be able to work around the issue by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found