concatenate_datasets loads all the data into memory
See original GitHub issueDescribe the bug
When I try to concatenate 2 datasets (10GB each) , the entire data is loaded into memory instead of being written directly to disk.
Interestingly, this happens when trying to save the new dataset to disk or concatenating it again.
Steps to reproduce the bug
from datasets import concatenate_datasets, load_from_disk
test_sampled_pro = load_from_disk("test_sampled_pro")
val_sampled_pro = load_from_disk("val_sampled_pro")
big_set = concatenate_datasets([test_sampled_pro, val_sampled_pro])
# Loaded to memory
big_set.save_to_disk("big_set")
# Loaded to memory
big_set = concatenate_datasets([big_set, val_sampled_pro])
Expected results
The data should be loaded into memory in batches and then saved directly to disk.
Actual results
The entire data set is loaded into the memory and then saved to the hard disk.
Versions
Paste the output of the following code:
- Datasets: 1.6.1
- Python: 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
- Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.10
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Why does my memory usage explode when concatenating ...
In this article we will take a look at a memory issue that I've run into multiple times in real life datasets -...
Read more >Concatenate_datasets loads everything into RAM · Issue #4924
Concatenated dataset gets loaded into RAM and overflows it which gets the process killed. Environment info. datasets version: 2.4.0; Platform: ...
Read more >How to concatenate and shuffle two tensorflow dataset with ...
The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory. Timo_v: melanoma_ds.skip(tmp_start).
Read more >How can I load and merge several .txt files in a memory ...
First of all, memory is progressively allocated so this is why the process does not directly crashes: each CSV takes some memory space...
Read more >Process - Hugging Face
Rename and remove columns, and other common column operations. Apply processing functions to each example in a dataset. Concatenate datasets. Apply a custom ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @samsontmr @TaskManager91 the fix is on the master branch, feel free to install
datasets
from source and let us know if you still have issuesThanks for the insights @mariosasko ! I’m working on a fix. Since this is a big issue I’ll make a patch release as soon as this is fixed