Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

concatenate_datasets loads all the data into memory

See original GitHub issue

Describe the bug

When I try to concatenate 2 datasets (10GB each) , the entire data is loaded into memory instead of being written directly to disk.

Interestingly, this happens when trying to save the new dataset to disk or concatenating it again.

Steps to reproduce the bug

from datasets import concatenate_datasets, load_from_disk

test_sampled_pro = load_from_disk("test_sampled_pro")
val_sampled_pro = load_from_disk("val_sampled_pro")

big_set = concatenate_datasets([test_sampled_pro, val_sampled_pro])

# Loaded to memory
big_set.save_to_disk("big_set")

# Loaded to memory
big_set = concatenate_datasets([big_set, val_sampled_pro])

Expected results

The data should be loaded into memory in batches and then saved directly to disk.

Actual results

The entire data set is loaded into the memory and then saved to the hard disk.

Versions

Paste the output of the following code:

- Datasets: 1.6.1
- Python: 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0]
- Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.10

Issue Analytics

State:
Created 2 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

4reactions

lhoestqcommented, Apr 29, 2021

Hi @samsontmr @TaskManager91 the fix is on the master branch, feel free to install datasets from source and let us know if you still have issues

2reactions

lhoestqcommented, Apr 29, 2021

Thanks for the insights @mariosasko ! I’m working on a fix. Since this is a big issue I’ll make a patch release as soon as this is fixed

Top Results From Across the Web

Why does my memory usage explode when concatenating ...

In this article we will take a look at a memory issue that I've run into multiple times in real life datasets -...

Concatenate_datasets loads everything into RAM · Issue #4924

Concatenated dataset gets loaded into RAM and overflows it which gets the process killed. Environment info. datasets version: 2.4.0; Platform: ...

How to concatenate and shuffle two tensorflow dataset with ...

The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory. Timo_v: melanoma_ds.skip(tmp_start).

How can I load and merge several .txt files in a memory ...

First of all, memory is progressively allocated so this is why the process does not directly crashes: each CSV takes some memory space...

Process - Hugging Face

Rename and remove columns, and other common column operations. Apply processing functions to each example in a dataset. Concatenate datasets. Apply a custom ......