question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

concatenate_datasets loads all the data into memory

See original GitHub issue

Describe the bug

When I try to concatenate 2 datasets (10GB each) , the entire data is loaded into memory instead of being written directly to disk.

Interestingly, this happens when trying to save the new dataset to disk or concatenating it again.

image

Steps to reproduce the bug

from datasets import concatenate_datasets, load_from_disk

test_sampled_pro = load_from_disk("test_sampled_pro")
val_sampled_pro = load_from_disk("val_sampled_pro")

big_set = concatenate_datasets([test_sampled_pro, val_sampled_pro])

# Loaded to memory
big_set.save_to_disk("big_set")

# Loaded to memory
big_set = concatenate_datasets([big_set, val_sampled_pro])

Expected results

The data should be loaded into memory in batches and then saved directly to disk.

Actual results

The entire data set is loaded into the memory and then saved to the hard disk.

Versions

Paste the output of the following code:

- Datasets: 1.6.1
- Python: 3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0]
- Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.10

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
lhoestqcommented, Apr 29, 2021

Hi @samsontmr @TaskManager91 the fix is on the master branch, feel free to install datasets from source and let us know if you still have issues

2reactions
lhoestqcommented, Apr 29, 2021

Thanks for the insights @mariosasko ! I’m working on a fix. Since this is a big issue I’ll make a patch release as soon as this is fixed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does my memory usage explode when concatenating ...
In this article we will take a look at a memory issue that I've run into multiple times in real life datasets -...
Read more >
Concatenate_datasets loads everything into RAM · Issue #4924
Concatenated dataset gets loaded into RAM and overflows it which gets the process killed. Environment info. datasets version: 2.4.0; Platform: ...
Read more >
How to concatenate and shuffle two tensorflow dataset with ...
The main thing to remember here is that shuffle runs in memory. So this loads all 20000 images into memory. Timo_v: melanoma_ds.skip(tmp_start).
Read more >
How can I load and merge several .txt files in a memory ...
First of all, memory is progressively allocated so this is why the process does not directly crashes: each CSV takes some memory space...
Read more >
Process - Hugging Face
Rename and remove columns, and other common column operations. Apply processing functions to each example in a dataset. Concatenate datasets. Apply a custom ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found