Creating dataset consumes too much memory
See original GitHub issueMoving this issue from https://github.com/huggingface/datasets/pull/722 here, because it seems like a general issue.
Given the following dataset example, where each example saves a sequence of 260x210x3 images (max length 400):
def _generate_examples(self, base_path, split):
""" Yields examples. """
filepath = os.path.join(base_path, "annotations", "manual", "PHOENIX-2014-T." + split + ".corpus.csv")
images_path = os.path.join(base_path, "features", "fullFrame-210x260px", split)
with open(filepath, "r", encoding="utf-8") as f:
data = csv.DictReader(f, delimiter="|", quoting=csv.QUOTE_NONE)
for row in data:
frames_path = os.path.join(images_path, row["video"])[:-7]
np_frames = []
for frame_name in os.listdir(frames_path):
frame_path = os.path.join(frames_path, frame_name)
im = Image.open(frame_path)
np_frames.append(np.asarray(im))
im.close()
yield row["name"], {"video": np_frames}
The dataset creation process goes out of memory on a machine with 500GB RAM. I was under the impression that the “generator” here is exactly for that, to avoid memory constraints.
However, even if you want the entire dataset in memory, it would be in the worst case
260x210x3 x 400 max length x 7000 samples
in bytes (uint8) = 458.64 gigabytes
So I’m not sure why it’s taking more than 500GB.
And the dataset creation fails after 170 examples on a machine with 120gb RAM, and after 672 examples on a machine with 500GB RAM.
Info that might help:
Iterating over examples is extremely slow.
If I perform this iteration in my own, custom loop (Without saving to file), it runs at 8-9 examples/sec
And you can see at this state it is using 94% of the memory:
And it is only using one CPU core, which is probably why it’s so slow:
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:20 (18 by maintainers)
Ok found the issue. This is because the batch size used by the writer is set to 10 000 elements by default so it would load your full dataset in memory (the writer has a buffer that flushes only after each batch). Moreover to write in Apache Arrow we have to use python objects so what’s stored inside the ArrowWriter’s buffer is actually python integers (32 bits).
Lowering the batch size to 10 should do the job.
I will add a flag to the DatasetBuilder class of dataset scripts, so that we can customize the batch size.
Yes you did it right. Did you rebase to include the changes of #828 ?
EDIT: looks like you merged from master in the PR. Not sure why you still have an issue then, I will investigate