Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

With dataloader RSS memory consumed by HF datasets monotonically increases

See original GitHub issue

Describe the bug

When the HF datasets is used in conjunction with PyTorch Dataloader, the RSS memory of the process keeps on increasing when it should stay constant.

Steps to reproduce the bug

Run and observe the output of this snippet which logs RSS memory.

import psutil
import os
from transformers import BertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader

BATCH_SIZE = 32
NUM_TRIES = 10

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def transform(x):
    x.update(tokenizer(x["text"], return_tensors="pt", max_length=64, padding="max_length", truncation=True))
    x.pop("text")
    x.pop("label")
    return x
dataset = load_dataset("imdb", split="train")
dataset.set_transform(transform)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
count = 0
while count < NUM_TRIES:
    for idx, batch in enumerate(train_loader):
        mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
        print(count, idx, mem_after - mem_before)
    count += 1

Expected results

Memory should not increase after initial setup and loading of the dataset

Actual results

Memory continuously increases as can be seen in the log.

Environment info

datasets version: 2.3.2
Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.10
Python version: 3.8.13
PyArrow version: 7.0.0

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:40 (31 by maintainers)

Top GitHub Comments

6reactions

stas00commented, Sep 10, 2022

There is no leak, just badly communicated linux RSS memory usage stats

Next, lets revisit @rwightman’s suggestion that there is actually no leak.

After all - we are using mmap which will try to map the file to RAM as much as it can and then page out if there is no memory. i.e. MMAP is only fast if you have a lot of CPU RAM.

So let’s do it:

Memory mapping OOM test

We first quickly start a cgroups-controlled shell which will instantly kill any program that consumes more than 1GB of memory:

$ systemd-run --user --scope -p MemoryHigh=1G -p MemoryMax=1G -p MemorySwapMax=1G --setenv="MEMLIMIT=1GB" bash

Let’s check that it indeed does so. Let’s change @lhoestq’s script to allocate a 10GB arrow file:

$ python -c 'import pyarrow as pa; pa.array([b"a" * (2000 * 1024)] * 5000)'
Killed

oops, that didn’t work, as we tried to allocate 10GB when only 1GB is allowed. This is what we want!

Let’s do a sanity check - can we allocate 0.1GB?

python -c 'import pyarrow as pa; pa.array([b"a" * (2000 * 1024)] * 50)'

Yes. So the limited shell does the right thing. It let’s allocate < 1GB of RSS RAM.

Next let’s go back to @lhoestq’s script but with 10GB arrow file.

we change his repro script https://github.com/huggingface/datasets/issues/4883#issuecomment-1242034985 to 50x larger file

    arr = pa.array([b"a" * (2000 * 1024)] * 5000)  # ~10000MB

we first have to run into a normal unlimited shell so that we don’t get killed (as the script allocates 10GB)

let’s run the script now in the 1GB-limited shell while running a monitor:

$ htop -F python -s M_RESIDENT -u `whoami`

so we have 2 sources of RSS info just in case.

$ python pyar.py
   0       4.3516MB       0.0103       0.0194 9.766e+03MB  0.0MB  0.0MB
  10      24.3008MB       0.0104       0.0195 9.766e+03MB  0.0MB  0.0MB
[...]
4980    9730.3672MB       0.0108       0.0199 9.766e+03MB  0.0MB  0.0MB
4990    9750.9922MB       0.0108       0.0199 9.766e+03MB  0.0MB  0.0MB
PyArrow mem pool info: jemalloc backend, 0 allocated, 0 max allocated,
PyArrow total allocated bytes: 0

But wait, it reported 10GB RSS both in htop and in our log!

So that means it never allocated 10GB otherwise it’d have been killed.

Which tells us that there is no leak whatsoever and this is just a really difficult situation where MMAPPED memory is reported as part of RSS which it probably shouldn’t. As now we have no way how to measure real memory usage.

I also attached the script with all the different things I have tried in it, so it should be easy to turn them on/off if you want to reproduce any of my findings.

pyar.txt

just rename it to pyra.py as gh doesn’t let attaching scripts…

(I have to remember to exit that special mem-limited shell or else I won’t be able to do anything serious there.)

2reactions

stas00commented, Sep 16, 2022

Thank you for these suggestions, Ross.

The problem is that most of the time we use a bunch of non-python libs that are binded to python and so besides python, one has to deal with not-quite controllable allocation strategies by those other components as well. So it’s a super-tricky world.

Good suggestion on memory fragmentation, which could definitely be one of the sources for ever-growing RSS. pytorch’s memory management utils are mostly quite excellent, and fragmentation is one of the main issues there. Projects like Deepspeed try to solve it by pre-allocating memory themselves and then managing it tightly to avoid fragmentation, which seems to work quite well.

BTW, I’m not sure if you have seen this tool I developed some years back to automatically track and report CPU and GPU memory usage in Jupyter notebooks. https://github.com/stas00/ipyexperiments I found it to be quite useful for detecting memory leakage - of course it’s the same RSS for CPU, but it’s just automated where each cell reports the delta. One other tricky thing to measure is CPU peak memory which it provides. As often there are those temp leaks which lead to OOMs.

Top Results From Across the Web

Big data? Datasets to the rescue! - Hugging Face Course

Now let's see how much RAM we've used to load the dataset! The magic of memory mapping. A simple way to measure memory...

network performance measurements: Topics by Science.gov

Method We take standard published network data sets while using a random ... them are strictly monotonically increasing, so the PFE is capable...

Understanding Memory Usage by PyTorch DataLoader Workers

Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? Not necessary. You have...

Regularization in Deep Learning MEAP V03

models is to achieve good generalization performance in the test dataset. ... Regularization contains a suite of strategies used to reduce the ...

The NetBSD package collection - pkgsrc.se

sysutils/p5-Sys-MemInfo [CURRENT], Perl 5 module for querying physical memory usage. www/ruby-rails70 [CURRENT], Full-stack web application framework, ...