Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Creating dataset consumes too much memory

See original GitHub issue

Moving this issue from https://github.com/huggingface/datasets/pull/722 here, because it seems like a general issue.

Given the following dataset example, where each example saves a sequence of 260x210x3 images (max length 400):

    def _generate_examples(self, base_path, split):
        """ Yields examples. """

        filepath = os.path.join(base_path, "annotations", "manual", "PHOENIX-2014-T." + split + ".corpus.csv")
        images_path = os.path.join(base_path, "features", "fullFrame-210x260px", split)

        with open(filepath, "r", encoding="utf-8") as f:
            data = csv.DictReader(f, delimiter="|", quoting=csv.QUOTE_NONE)
            for row in data:
                frames_path = os.path.join(images_path, row["video"])[:-7]
                np_frames = []
                for frame_name in os.listdir(frames_path):
                    frame_path = os.path.join(frames_path, frame_name)
                    im = Image.open(frame_path)
                    np_frames.append(np.asarray(im))
                    im.close()

                yield row["name"], {"video": np_frames}

The dataset creation process goes out of memory on a machine with 500GB RAM. I was under the impression that the “generator” here is exactly for that, to avoid memory constraints.

However, even if you want the entire dataset in memory, it would be in the worst case 260x210x3 x 400 max length x 7000 samples in bytes (uint8) = 458.64 gigabytes So I’m not sure why it’s taking more than 500GB.

And the dataset creation fails after 170 examples on a machine with 120gb RAM, and after 672 examples on a machine with 500GB RAM.

Info that might help:

Iterating over examples is extremely slow. If I perform this iteration in my own, custom loop (Without saving to file), it runs at 8-9 examples/sec

And you can see at this state it is using 94% of the memory:

And it is only using one CPU core, which is probably why it’s so slow:

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:20 (18 by maintainers)

Top GitHub Comments

2reactions

lhoestqcommented, Oct 29, 2020

Ok found the issue. This is because the batch size used by the writer is set to 10 000 elements by default so it would load your full dataset in memory (the writer has a buffer that flushes only after each batch). Moreover to write in Apache Arrow we have to use python objects so what’s stored inside the ArrowWriter’s buffer is actually python integers (32 bits).

Lowering the batch size to 10 should do the job.

I will add a flag to the DatasetBuilder class of dataset scripts, so that we can customize the batch size.

1reaction

lhoestqcommented, Nov 17, 2020

Yes you did it right. Did you rebase to include the changes of #828 ?

EDIT: looks like you merged from master in the PR. Not sure why you still have an issue then, I will investigate