With dataloader RSS memory consumed by HF datasets monotonically increases
See original GitHub issueDescribe the bug
When the HF datasets is used in conjunction with PyTorch Dataloader, the RSS memory of the process keeps on increasing when it should stay constant.
Steps to reproduce the bug
Run and observe the output of this snippet which logs RSS memory.
import psutil
import os
from transformers import BertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
BATCH_SIZE = 32
NUM_TRIES = 10
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def transform(x):
x.update(tokenizer(x["text"], return_tensors="pt", max_length=64, padding="max_length", truncation=True))
x.pop("text")
x.pop("label")
return x
dataset = load_dataset("imdb", split="train")
dataset.set_transform(transform)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
count = 0
while count < NUM_TRIES:
for idx, batch in enumerate(train_loader):
mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
print(count, idx, mem_after - mem_before)
count += 1
Expected results
Memory should not increase after initial setup and loading of the dataset
Actual results
Memory continuously increases as can be seen in the log.
Environment info
datasets
version: 2.3.2- Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.10
- Python version: 3.8.13
- PyArrow version: 7.0.0
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:40 (31 by maintainers)
Top Results From Across the Web
Big data? Datasets to the rescue! - Hugging Face Course
Now let's see how much RAM we've used to load the dataset! The magic of memory mapping. A simple way to measure memory...
Read more >network performance measurements: Topics by Science.gov
Method We take standard published network data sets while using a random ... them are strictly monotonically increasing, so the PFE is capable...
Read more >Understanding Memory Usage by PyTorch DataLoader Workers
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? Not necessary. You have...
Read more >Regularization in Deep Learning MEAP V03
models is to achieve good generalization performance in the test dataset. ... Regularization contains a suite of strategies used to reduce the ...
Read more >The NetBSD package collection - pkgsrc.se
sysutils/p5-Sys-MemInfo [CURRENT], Perl 5 module for querying physical memory usage. www/ruby-rails70 [CURRENT], Full-stack web application framework, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There is no leak, just badly communicated linux RSS memory usage stats
Next, lets revisit @rwightman’s suggestion that there is actually no leak.
After all - we are using mmap which will try to map the file to RAM as much as it can and then page out if there is no memory. i.e. MMAP is only fast if you have a lot of CPU RAM.
So let’s do it:
Memory mapping OOM test
We first quickly start a cgroups-controlled shell which will instantly kill any program that consumes more than 1GB of memory:
Let’s check that it indeed does so. Let’s change @lhoestq’s script to allocate a 10GB arrow file:
oops, that didn’t work, as we tried to allocate 10GB when only 1GB is allowed. This is what we want!
Let’s do a sanity check - can we allocate 0.1GB?
Yes. So the limited shell does the right thing. It let’s allocate
< 1GB
of RSS RAM.Next let’s go back to @lhoestq’s script but with 10GB arrow file.
we change his repro script https://github.com/huggingface/datasets/issues/4883#issuecomment-1242034985 to 50x larger file
we first have to run into a normal unlimited shell so that we don’t get killed (as the script allocates 10GB)
let’s run the script now in the 1GB-limited shell while running a monitor:
so we have 2 sources of RSS info just in case.
But wait, it reported 10GB RSS both in
htop
and in our log!So that means it never allocated 10GB otherwise it’d have been killed.
Which tells us that there is no leak whatsoever and this is just a really difficult situation where MMAPPED memory is reported as part of RSS which it probably shouldn’t. As now we have no way how to measure real memory usage.
I also attached the script with all the different things I have tried in it, so it should be easy to turn them on/off if you want to reproduce any of my findings.
pyar.txt
just rename it to
pyra.py
as gh doesn’t let attaching scripts…(I have to remember to exit that special mem-limited shell or else I won’t be able to do anything serious there.)
Thank you for these suggestions, Ross.
The problem is that most of the time we use a bunch of non-python libs that are binded to python and so besides python, one has to deal with not-quite controllable allocation strategies by those other components as well. So it’s a super-tricky world.
Good suggestion on memory fragmentation, which could definitely be one of the sources for ever-growing RSS. pytorch’s memory management utils are mostly quite excellent, and fragmentation is one of the main issues there. Projects like Deepspeed try to solve it by pre-allocating memory themselves and then managing it tightly to avoid fragmentation, which seems to work quite well.
BTW, I’m not sure if you have seen this tool I developed some years back to automatically track and report CPU and GPU memory usage in Jupyter notebooks. https://github.com/stas00/ipyexperiments I found it to be quite useful for detecting memory leakage - of course it’s the same RSS for CPU, but it’s just automated where each cell reports the delta. One other tricky thing to measure is CPU peak memory which it provides. As often there are those temp leaks which lead to OOMs.