Memory leak when iterating a Dataset
See original GitHub issuee## Describe the bug
It seems that memory never gets freed after iterating a Dataset
(using .map()
or a simple for
loop)
Steps to reproduce the bug
import gc
import logging
import time
import pyarrow
from datasets import load_dataset
from tqdm import trange
import os, psutil
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
process = psutil.Process(os.getpid())
print(process.memory_info().rss) # output: 633507840 bytes
corpus = load_dataset("BeIR/msmarco", 'corpus', keep_in_memory=False, streaming=False)['corpus'] # or "BeIR/trec-covid" for a smaller dataset
print(process.memory_info().rss) # output: 698601472 bytes
logger.info("Applying method to all examples in all splits")
for i in trange(0, len(corpus), 1000):
batch = corpus[i:i+1000]
data = pyarrow.total_allocated_bytes()
if data > 0:
logger.info(f"{i}/{len(corpus)}: {data}")
print(process.memory_info().rss) # output: 3788247040 bytes
del batch
gc.collect()
print(process.memory_info().rss) # output: 3788247040 bytes
logger.info("Done...")
time.sleep(100)
Expected results
Limited memory usage, and memory to be freed after processing
Actual results
Memory leak
You can see how the memory allocation keeps increasing until it reaches a steady state when we hit the time.sleep(100)
, which showcases that even the garbage collector couldn’t free the allocated memory
Environment info
datasets
version: 2.3.2- Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Memory leak with tf.data - python - Stack Overflow
Dataset inside a for loop and I noticed that the memory was not freed as one would expect after each iteration. Is there...
Read more >RDataFrame Foreach causing memory leak - ROOT Forum
The problem is that the Foreach operation seems leaking memory: I'm performing this action hundreds of times on ~5000 rows with 5 columns ......
Read more >Memory Leak Detection Algorithms in the Cloud-based ... - arXiv
Abstract—A memory leak in an application deployed on the cloud can affect the availability and reliability of the application.
Read more >RE: Memory "leak" when reading and writing chunked data ...
RE: Memory "leak" when reading and writing chunked data [SEC=UNCLASSIFIED] ... During first loop growth in memory usage is more sporadic, during second...
Read more >DataLoader Memory Leak? - PyTorch Forums
If you see the increase in memory usage during the dummy DataLoader loop, the issue might be in the Dataset and in particular...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Amazing job! Thanks for taking time to debug this 🤗
For my side, I tried to do some more research as well, but to no avail. https://github.com/huggingface/datasets/issues/4883#issuecomment-1243415957
@NouamaneTazi, please check my analysis here https://github.com/huggingface/datasets/issues/4883#issuecomment-1242599722 so if you agree with my research this Issue can be closed as well.
I also made a suggestion at how to proceed to hunt for a real leak here https://github.com/huggingface/datasets/issues/4883#issuecomment-1242600626
you may find this one to be useful as well https://github.com/huggingface/datasets/issues/4883#issuecomment-1242597966