question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak when iterating a Dataset

See original GitHub issue

e## Describe the bug It seems that memory never gets freed after iterating a Dataset (using .map() or a simple for loop)

Steps to reproduce the bug

import gc
import logging
import time
import pyarrow
from datasets import load_dataset
from tqdm import trange
import os, psutil

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
process = psutil.Process(os.getpid())

print(process.memory_info().rss)  # output: 633507840 bytes

corpus = load_dataset("BeIR/msmarco", 'corpus', keep_in_memory=False, streaming=False)['corpus'] # or "BeIR/trec-covid" for a smaller dataset

print(process.memory_info().rss)  # output: 698601472 bytes

logger.info("Applying method to all examples in all splits")
for i in trange(0, len(corpus), 1000):
    batch = corpus[i:i+1000]
    data = pyarrow.total_allocated_bytes()
    if data > 0:
        logger.info(f"{i}/{len(corpus)}: {data}")

print(process.memory_info().rss)  # output: 3788247040 bytes

del batch
gc.collect()

print(process.memory_info().rss)  # output: 3788247040 bytes

logger.info("Done...")
time.sleep(100)

Expected results

Limited memory usage, and memory to be freed after processing

Actual results

Memory leak test You can see how the memory allocation keeps increasing until it reaches a steady state when we hit the time.sleep(100), which showcases that even the garbage collector couldn’t free the allocated memory

Environment info

  • datasets version: 2.3.2
  • Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.31
  • Python version: 3.9.7
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
NouamaneTazicommented, Sep 12, 2022

Amazing job! Thanks for taking time to debug this 🤗

For my side, I tried to do some more research as well, but to no avail. https://github.com/huggingface/datasets/issues/4883#issuecomment-1243415957

1reaction
stas00commented, Sep 10, 2022

@NouamaneTazi, please check my analysis here https://github.com/huggingface/datasets/issues/4883#issuecomment-1242599722 so if you agree with my research this Issue can be closed as well.

I also made a suggestion at how to proceed to hunt for a real leak here https://github.com/huggingface/datasets/issues/4883#issuecomment-1242600626

you may find this one to be useful as well https://github.com/huggingface/datasets/issues/4883#issuecomment-1242597966

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak with tf.data - python - Stack Overflow
Dataset inside a for loop and I noticed that the memory was not freed as one would expect after each iteration. Is there...
Read more >
RDataFrame Foreach causing memory leak - ROOT Forum
The problem is that the Foreach operation seems leaking memory: I'm performing this action hundreds of times on ~5000 rows with 5 columns ......
Read more >
Memory Leak Detection Algorithms in the Cloud-based ... - arXiv
Abstract—A memory leak in an application deployed on the cloud can affect the availability and reliability of the application.
Read more >
RE: Memory "leak" when reading and writing chunked data ...
RE: Memory "leak" when reading and writing chunked data [SEC=UNCLASSIFIED] ... During first loop growth in memory usage is more sporadic, during second...
Read more >
DataLoader Memory Leak? - PyTorch Forums
If you see the increase in memory usage during the dummy DataLoader loop, the issue might be in the Dataset and in particular...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found