Shared memory issues with parallelization
See original GitHub issueHi @kdexd
I am running into all kinds of shared memory errors after this commit 9c1ee36b85c2c63d554471cac2825cf0b9cf2efd
https://github.com/pytorch/pytorch/issues/8976 https://github.com/pytorch/pytorch/issues/973
I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:
torch.multiprocessing.set_sharing_strategy('file_system')
# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048*4, rlimit[1]))
Is there a leak somewhere? Might be best to have a look.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Shared memory parallelization
Parallel programming exploits the advantages of multiprocessor systems while maintaining a full binary compatibility with existing uniprocessor systems. This ...
Read more >Shared memory parallel programming
There are essentially two issues with parallel programming on shared memory architectures: ... The first issue is very complex and cannot be solved...
Read more >Parallel Performance Problems on Shared-Memory ...
Sharing of data between CPUs on NUMA systems. This problem occurs on multiple CPU machines, which often have non-uniform memory access times for...
Read more >Parallel Performance Problems on Shared-Memory ...
Parallel Performance Problems on Shared-Memory Multicore Systems: Taxonomy and Observation. Abstract: The shift towards multicore processing ...
Read more >Comparison of Shared memory based parallel ...
platform with more number of processors so that the problem or ... Keywords: Parallel Programming models, Distributed memory, Shared memory, Dwarfs,.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yeah I did try with 1 worker. Had the same errors. (Cant use 0 because this requires at least one worker 😄 )
Have removed multiprocess tokenization in my code and it works fine.
Just to let you know it doesn’t happen at starting iterations or epochs. I guess it was after 3-5 epochs.
I think I’m hitting this.
In my setup I’m doing independent runs in parallel threads (not processes, since I’m using LevelDB and it does not support multiprocessing). Sometimes it breaks with the error:
RuntimeError: received 0 items of ancdata
Even though I’m using the workaround suggested here: https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667