question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow dataloading with big datasets issue persists

See original GitHub issue

Hi,

I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:

  1. Running with 60GB
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  517.96         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
model_backward                     	|  0.26144        	|100            	|  26.144         	|  5.0475         	|
model_forward                      	|  0.11123        	|100            	|  11.123         	|  2.1474         	|
get_train_batch                    	|  0.097121       	|100            	|  9.7121         	|  1.8751         	|
  1. Running with 600GB, datasets==1.6.0
Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  4563.2         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch                    	|  5.1279         	|100            	|  512.79         	|  11.237         	|
model_backward                     	|  4.8394         	|100            	|  483.94         	|  10.605         	|
model_forward                      	|  0.12162        	|100            	|  12.162         	|  0.26652        	|

I see that get_train_batch lags when data is large. Could this be related to different issues? I would be happy to provide necessary information to investigate.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:51 (19 by maintainers)

github_iconTop GitHub Comments

4reactions
lhoestqcommented, Mar 17, 2022

This has been a very interesting discussion to read. Are there any updates on it? I take it that the best option we have now is to shard our data into multiple datasets and concatenate them as shown above by @hwijeen.

If this solution proves to help, we can add an arrow files sharding for all big datasets directly integrated in load_dataset.

I’m hoping that by using the huggingface Dataset, the data loader will just index into the pyarrow table and the dataset won’t be loaded in full in each process (but we have to pay the cost of the load_data in each process presumably so that the data loader can index into the table on that process)?

Yes your intuition is right 😃

4reactions
lhoestqcommented, May 26, 2021

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.

I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.

I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn’t help.

Memory mapping is handled by the OS and depends on the disk you’re using, so I’m not sure we can do much about it. I’ll continue to investigate anyway, because I still don’t know why in some cases it would go through the entire file (high Blocks read as in your tests) and in other cases it would do almost no reading.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training fast with small dataset, slow with large dataset - vision
My issue is related to seconds per iteration getting slower with more data, not each epoch taking longer with more data (which is...
Read more >
Dask and Pandas: There's No Such Thing as Too Much Data
Data is too large to hold in memory (memory constraint) ... assess how much time you're spending on data loading, how slow the...
Read more >
Dataset slow during model training - Hugging Face Forums
While migrating towards :hugs: Datasets, I encountered an odd performance degradation: training suddenly slows down dramatically.
Read more >
Amazon Redshift best practices for loading data
Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also...
Read more >
Why is Spark So Slow? (& How Can I Fix Things?) | Pepperdata
But while capable of handling an impressively wide range of workloads and big data sets, Spark can sometimes struggle. Here's why, and here's ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found