Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shared memory issues with parallelization

See original GitHub issue

Hi @kdexd

I am running into all kinds of shared memory errors after this commit 9c1ee36b85c2c63d554471cac2825cf0b9cf2efd

https://github.com/pytorch/pytorch/issues/8976 https://github.com/pytorch/pytorch/issues/973

I guess this parallelization is not stable; sometimes it run while sometimes it breaks (even though after trying possible solutions) such as:

torch.multiprocessing.set_sharing_strategy('file_system')

# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048*4, rlimit[1]))

Is there a leak somewhere? Might be best to have a look.

Issue Analytics

State:
Created 4 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

shubhamagarwal92commented, Aug 6, 2019

Yeah I did try with 1 worker. Had the same errors. (Cant use 0 because this requires at least one worker 😄 )

Have removed multiprocess tokenization in my code and it works fine.

Just to let you know it doesn’t happen at starting iterations or epochs. I guess it was after 3-5 epochs.

1reaction

lucmoscommented, Feb 11, 2020

I think I’m hitting this.

In my setup I’m doing independent runs in parallel threads (not processes, since I’m using LevelDB and it does not support multiprocessing). Sometimes it breaks with the error:

RuntimeError: received 0 items of ancdata

Even though I’m using the workaround suggested here: https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667

Top Results From Across the Web

Shared memory parallelization

Parallel programming exploits the advantages of multiprocessor systems while maintaining a full binary compatibility with existing uniprocessor systems. This ...

Shared memory parallel programming

There are essentially two issues with parallel programming on shared memory architectures: ... The first issue is very complex and cannot be solved...

Parallel Performance Problems on Shared-Memory ...

Sharing of data between CPUs on NUMA systems. This problem occurs on multiple CPU machines, which often have non-uniform memory access times for...

Parallel Performance Problems on Shared-Memory ...

Parallel Performance Problems on Shared-Memory Multicore Systems: Taxonomy and Observation. Abstract: The shift towards multicore processing ...

Comparison of Shared memory based parallel ...

platform with more number of processors so that the problem or ... Keywords: Parallel Programming models, Distributed memory, Shared memory, Dwarfs,.