question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shared memory issues with parallelization

See original GitHub issue

Hi @kdexd

I am running into all kinds of shared memory errors after this commit 9c1ee36b85c2c63d554471cac2825cf0b9cf2efd

https://github.com/pytorch/pytorch/issues/8976 https://github.com/pytorch/pytorch/issues/973

I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:

torch.multiprocessing.set_sharing_strategy('file_system')

# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048*4, rlimit[1]))

Is there a leak somewhere? Might be best to have a look.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
shubhamagarwal92commented, Aug 6, 2019

Yeah I did try with 1 worker. Had the same errors. (Cant use 0 because this requires at least one worker 😄 )

Have removed multiprocess tokenization in my code and it works fine.

Just to let you know it doesn’t happen at starting iterations or epochs. I guess it was after 3-5 epochs.

1reaction
lucmoscommented, Feb 11, 2020

I think I’m hitting this.

In my setup I’m doing independent runs in parallel threads (not processes, since I’m using LevelDB and it does not support multiprocessing). Sometimes it breaks with the error:

RuntimeError: received 0 items of ancdata

Even though I’m using the workaround suggested here: https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667

Read more comments on GitHub >

github_iconTop Results From Across the Web

Shared memory parallelization
Parallel programming exploits the advantages of multiprocessor systems while maintaining a full binary compatibility with existing uniprocessor systems. This ...
Read more >
Shared memory parallel programming
There are essentially two issues with parallel programming on shared memory architectures: ... The first issue is very complex and cannot be solved...
Read more >
Parallel Performance Problems on Shared-Memory ...
Sharing of data between CPUs on NUMA systems. This problem occurs on multiple CPU machines, which often have non-uniform memory access times for...
Read more >
Parallel Performance Problems on Shared-Memory ...
Parallel Performance Problems on Shared-Memory Multicore Systems: Taxonomy and Observation. Abstract: The shift towards multicore processing ...
Read more >
Comparison of Shared memory based parallel ...
platform with more number of processors so that the problem or ... Keywords: Parallel Programming models, Distributed memory, Shared memory, Dwarfs,.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found