Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow dataloading with big datasets issue persists

See original GitHub issue

Hi,

I reported too slow data fetching when data is large(#2210) a couple of weeks ago, and @lhoestq referred me to the fix (#2122). However, the problem seems to persist. Here is the profiled results:

Running with 60GB

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  517.96         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
model_backward                     	|  0.26144        	|100            	|  26.144         	|  5.0475         	|
model_forward                      	|  0.11123        	|100            	|  11.123         	|  2.1474         	|
get_train_batch                    	|  0.097121       	|100            	|  9.7121         	|  1.8751         	|

Running with 600GB, datasets==1.6.0

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  4563.2         	|  100 %          	|
------------------------------------------------------------------------------------------------------------------------------------
get_train_batch                    	|  5.1279         	|100            	|  512.79         	|  11.237         	|
model_backward                     	|  4.8394         	|100            	|  483.94         	|  10.605         	|
model_forward                      	|  0.12162        	|100            	|  12.162         	|  0.26652        	|

I see that get_train_batch lags when data is large. Could this be related to different issues? I would be happy to provide necessary information to investigate.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:51 (19 by maintainers)

Top GitHub Comments

4reactions

lhoestqcommented, Mar 17, 2022

This has been a very interesting discussion to read. Are there any updates on it? I take it that the best option we have now is to shard our data into multiple datasets and concatenate them as shown above by @hwijeen.

If this solution proves to help, we can add an arrow files sharding for all big datasets directly integrated in load_dataset.

I’m hoping that by using the huggingface Dataset, the data loader will just index into the pyarrow table and the dataset won’t be loaded in full in each process (but we have to pay the cost of the load_data in each process presumably so that the data loader can index into the table on that process)?

Yes your intuition is right 😃

4reactions

lhoestqcommented, May 26, 2021

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no read operations which enables a very fast loading.

I did some tests on google colab and have the same issue. The first time the dataset arrow file is memory mapped takes always a lot of time (time seems linear with respect to the dataset size). Reloading the dataset is then instantaneous since the arrow file has already been memory mapped.

I also tried using the Arrow IPC file format (see #1933) instead of the current streaming format that we use but it didn’t help.

Memory mapping is handled by the OS and depends on the disk you’re using, so I’m not sure we can do much about it. I’ll continue to investigate anyway, because I still don’t know why in some cases it would go through the entire file (high Blocks read as in your tests) and in other cases it would do almost no reading.