question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating dataset consumes too much memory

See original GitHub issue

Moving this issue from https://github.com/huggingface/datasets/pull/722 here, because it seems like a general issue.

Given the following dataset example, where each example saves a sequence of 260x210x3 images (max length 400):

    def _generate_examples(self, base_path, split):
        """ Yields examples. """

        filepath = os.path.join(base_path, "annotations", "manual", "PHOENIX-2014-T." + split + ".corpus.csv")
        images_path = os.path.join(base_path, "features", "fullFrame-210x260px", split)

        with open(filepath, "r", encoding="utf-8") as f:
            data = csv.DictReader(f, delimiter="|", quoting=csv.QUOTE_NONE)
            for row in data:
                frames_path = os.path.join(images_path, row["video"])[:-7]
                np_frames = []
                for frame_name in os.listdir(frames_path):
                    frame_path = os.path.join(frames_path, frame_name)
                    im = Image.open(frame_path)
                    np_frames.append(np.asarray(im))
                    im.close()

                yield row["name"], {"video": np_frames}

The dataset creation process goes out of memory on a machine with 500GB RAM. I was under the impression that the “generator” here is exactly for that, to avoid memory constraints.

However, even if you want the entire dataset in memory, it would be in the worst case 260x210x3 x 400 max length x 7000 samples in bytes (uint8) = 458.64 gigabytes So I’m not sure why it’s taking more than 500GB.

And the dataset creation fails after 170 examples on a machine with 120gb RAM, and after 672 examples on a machine with 500GB RAM.


Info that might help:

Iterating over examples is extremely slow. image If I perform this iteration in my own, custom loop (Without saving to file), it runs at 8-9 examples/sec

And you can see at this state it is using 94% of the memory: image

And it is only using one CPU core, which is probably why it’s so slow: image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:20 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
lhoestqcommented, Oct 29, 2020

Ok found the issue. This is because the batch size used by the writer is set to 10 000 elements by default so it would load your full dataset in memory (the writer has a buffer that flushes only after each batch). Moreover to write in Apache Arrow we have to use python objects so what’s stored inside the ArrowWriter’s buffer is actually python integers (32 bits).

Lowering the batch size to 10 should do the job.

I will add a flag to the DatasetBuilder class of dataset scripts, so that we can customize the batch size.

1reaction
lhoestqcommented, Nov 17, 2020

Yes you did it right. Did you rebase to include the changes of #828 ?

EDIT: looks like you merged from master in the PR. Not sure why you still have an issue then, I will investigate

Read more comments on GitHub >

github_iconTop Results From Across the Web

What to Do When Your Data Is Too Big for Your Memory?
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those...
Read more >
Memory consumption, dataset size and performance
One of the approaches to fix the problem of memory boundness is to decrease the dataset size. This decreases the amount of traffic...
Read more >
Reducing Stata's Memory Usage
If a data set is too big to load into memory, for some tasks you can break it into a set of smaller...
Read more >
14 Simple Tips to save RAM memory for 1+GB dataset - Kaggle
Technique 1: Free Memory using gc.​​ In python notebook once a dataset loads into RAM it does not free on its own.So if...
Read more >
Why is my model having so much ram consumption and taking ...
Taken together, the settings from the original post might lead to significantly higher memory usage because multiple lightgbm.Dataset objects ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found