Slow dataloading with big datasets issue persists
See original GitHub issueHi,
I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:
- Running with 60GB
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 517.96 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
model_backward | 0.26144 |100 | 26.144 | 5.0475 |
model_forward | 0.11123 |100 | 11.123 | 2.1474 |
get_train_batch | 0.097121 |100 | 9.7121 | 1.8751 |
- Running with 600GB, datasets==1.6.0
Action | Mean duration (s) |Num calls | Total time (s) | Percentage % |
------------------------------------------------------------------------------------------------------------------------------------
Total | - |_ | 4563.2 | 100 % |
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch | 5.1279 |100 | 512.79 | 11.237 |
model_backward | 4.8394 |100 | 483.94 | 10.605 |
model_forward | 0.12162 |100 | 12.162 | 0.26652 |
I see that get_train_batch
lags when data is large. Could this be related to different issues?
I would be happy to provide necessary information to investigate.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:51 (19 by maintainers)
Top Results From Across the Web
Training fast with small dataset, slow with large dataset - vision
My issue is related to seconds per iteration getting slower with more data, not each epoch taking longer with more data (which is...
Read more >Dask and Pandas: There's No Such Thing as Too Much Data
Data is too large to hold in memory (memory constraint) ... assess how much time you're spending on data loading, how slow the...
Read more >Dataset slow during model training - Hugging Face Forums
While migrating towards :hugs: Datasets, I encountered an odd performance degradation: training suddenly slows down dramatically.
Read more >Amazon Redshift best practices for loading data
Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also...
Read more >Why is Spark So Slow? (& How Can I Fix Things?) | Pepperdata
But while capable of handling an impressively wide range of workloads and big data sets, Spark can sometimes struggle. Here's why, and here's ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
If this solution proves to help, we can add an arrow files sharding for all big datasets directly integrated in
load_dataset
.Yes your intuition is right 😃
Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.
I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.
I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn’t help.
Memory mapping is handled by the OS and depends on the disk you’re using, so I’m not sure we can do much about it. I’ll continue to investigate anyway, because I still don’t know why in some cases it would go through the entire file (high
Blocks read
as in your tests) and in other cases it would do almost no reading.