question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiprocessing with any DataPipe writing to local file

See original GitHub issue

🐛 Describe the bug

We need to take extra care all DataPipe that would write to file system when DataLoader2 triggered multiprocessing. If the file name on the local file system is same across multiple processes, it would be a racing condition. This is found when TorchText team is using on_disk_cache to cache file. DataLoader needs to know such DataPipe must be sharded with multiprocessing or enforce it into single process.

As a workaround, users have to download the file to local file system to prevent writing within DataPipe.

Versions

main branch

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
NivekTcommented, Apr 12, 2022

A mutex shared across processes would be sufficient in my mind. But, I don’t know if there is a cross-platform solution for it.

@hudeven is looking into the solution in this diff D35459528. Basically it is using io_path’s file locking mechanism which internally depends on portalocker. would this be a viable cross-platform solution?

I think if we can incorporate that into the IoPathSaver DataPipe, it should be a viable cross-platform solution, but it would mean users have to install iopath and portalocker if they wish to lock files across processes.

1reaction
NicolasHugcommented, Apr 12, 2022

Thanks for the ping Parmeet. I haven’t encountered this issue thus far, because torchvision datasets do not write anything to disk.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python multiprocessing safely writing to a file - Stack Overflow
@GP89 mentioned a good solution. Use a queue to send the writing tasks to a dedicated process that has sole write access to...
Read more >
Multiprocessing in Python | Set 2 (Communication between ...
In multiprocessing, any newly created process will do following: run independently; have their own memory space.
Read more >
multithreading, multiprocessing and parallel python ...
We will generate some data using one of the python files makedata.py by importing it in ipython. import makedata data = makedata.data() data....
Read more >
Python Multiprocessing with output to file | by Bk Lim - Medium
Recently I encountered a scenario where I need to write the parallelized results into an output file, and the direct approach of adding...
Read more >
multiprocessing.shared_memory — Shared memory for direct ...
As a resource for sharing data across processes, shared memory blocks may outlive the original process that created them. When one process no...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found