Datasets performance slow? - 6.4x slower than in memory dataset
See original GitHub issueI’ve been very excited about this amazing datasets project. However, I’ve noticed that the performance can be substantially slower than using an in-memory dataset.
Now, this is expected I guess, due to memory mapping data using arrow files, and you don’t get anything for free. But I was surprised at how much slower.
For example, in the yelp_polarity
dataset (560000 datapoints, or 17500 batches of 32), it was taking me 3:31 to just get process the data and get it on the GPU (no model involved). Whereas, the equivalent in-memory dataset would finish in just 0:33.
Is this expected? Given that one of the goals of this project is also accelerate dataset processing, this seems a bit slower than I would expect. I understand the advantages of being able to work on datasets that exceed memory, and that’s very exciting to me, but thought I’d open this issue to discuss.
For reference I’m running a AMD Ryzen Threadripper 1900X 8-Core Processor CPU, with 128 GB of RAM and an NVME SSD Samsung 960 EVO. I’m running with an RTX Titan 24GB GPU.
I can see with iotop
that the dataset gets quickly loaded into the system read buffers, and thus doesn’t incur any additional IO reads. Thus in theory, all the data should be in RAM, but in my benchmark code below it’s still 6.4 times slower.
What am I doing wrong? And is there a way to force the datasets to completely load into memory instead of being memory mapped in cases where you want maximum performance?
At 3:31 for 17500 batches, that’s 12ms per batch. Does this 12ms just become insignificant as a proportion of forward and backward passes in practice, and thus it’s not worth worrying about this in practice?
In any case, here’s my code benchmark.py
. If you run it with an argument of memory
it will copy the data into memory before executing the same test.
import sys
from datasets import load_dataset
from transformers import DataCollatorWithPadding, BertTokenizerFast
from torch.utils.data import DataLoader
from tqdm import tqdm
if __name__ == '__main__':
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
collate_fn = DataCollatorWithPadding(tokenizer, padding=True)
ds = load_dataset('yelp_polarity')
def do_tokenize(x):
return tokenizer(x['text'], truncation=True)
ds = ds.map(do_tokenize, batched=True)
ds.set_format('torch', ['input_ids', 'token_type_ids', 'attention_mask'])
if len(sys.argv) == 2 and sys.argv[1] == 'memory':
# copy to memory - probably a faster way to do this - but demonstrates the point
# approximately 530 batches per second - 17500 batches in 0:33
print('using memory')
_ds = [data for data in tqdm(ds['train'])]
else:
# approximately 83 batches per second - 17500 batches in 3:31
print('using datasets')
_ds = ds['train']
dl = DataLoader(_ds, shuffle=True, collate_fn=collate_fn, batch_size=32, num_workers=4)
for data in tqdm(dl):
for k, v in data.items():
data[k] = v.to('cuda')
For reference, my conda environment is here
Once again, I’m very excited about this library, and how easy it is to load datasets, and to do so without worrying about system memory constraints.
Thanks for all your great work.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:10 (4 by maintainers)
We should add it indeed and also maybe a specific section with all the tips for maximal speed. What do you think @lhoestq @SBrandeis @yjernite ?
My fault! I had not seen the
dataloader_num_workers
inTrainingArguments
! Now I can parallelize and go fast! Sorry, and thanks.