Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Datasets performance slow? - 6.4x slower than in memory dataset

See original GitHub issue

I’ve been very excited about this amazing datasets project. However, I’ve noticed that the performance can be substantially slower than using an in-memory dataset.

Now, this is expected I guess, due to memory mapping data using arrow files, and you don’t get anything for free. But I was surprised at how much slower.

For example, in the yelp_polarity dataset (560000 datapoints, or 17500 batches of 32), it was taking me 3:31 to just get process the data and get it on the GPU (no model involved). Whereas, the equivalent in-memory dataset would finish in just 0:33.

Is this expected? Given that one of the goals of this project is also accelerate dataset processing, this seems a bit slower than I would expect. I understand the advantages of being able to work on datasets that exceed memory, and that’s very exciting to me, but thought I’d open this issue to discuss.

For reference I’m running a AMD Ryzen Threadripper 1900X 8-Core Processor CPU, with 128 GB of RAM and an NVME SSD Samsung 960 EVO. I’m running with an RTX Titan 24GB GPU.

I can see with iotop that the dataset gets quickly loaded into the system read buffers, and thus doesn’t incur any additional IO reads. Thus in theory, all the data should be in RAM, but in my benchmark code below it’s still 6.4 times slower.

What am I doing wrong? And is there a way to force the datasets to completely load into memory instead of being memory mapped in cases where you want maximum performance?

At 3:31 for 17500 batches, that’s 12ms per batch. Does this 12ms just become insignificant as a proportion of forward and backward passes in practice, and thus it’s not worth worrying about this in practice?

In any case, here’s my code benchmark.py. If you run it with an argument of memory it will copy the data into memory before executing the same test.

import sys
from datasets import load_dataset
from transformers import DataCollatorWithPadding, BertTokenizerFast
from torch.utils.data import DataLoader
from tqdm import tqdm

if __name__ == '__main__':
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
    collate_fn = DataCollatorWithPadding(tokenizer, padding=True)

    ds = load_dataset('yelp_polarity')

    def do_tokenize(x):
        return tokenizer(x['text'], truncation=True)

    ds = ds.map(do_tokenize, batched=True)
    ds.set_format('torch', ['input_ids', 'token_type_ids', 'attention_mask'])

    if len(sys.argv) == 2 and sys.argv[1] == 'memory':
        # copy to memory - probably a faster way to do this - but demonstrates the point
        # approximately 530 batches per second - 17500 batches in 0:33
        print('using memory')
        _ds = [data for data in tqdm(ds['train'])]
    else:
        # approximately 83 batches per second - 17500 batches in 3:31
        print('using datasets')
        _ds = ds['train']

    dl = DataLoader(_ds, shuffle=True, collate_fn=collate_fn, batch_size=32, num_workers=4)

    for data in tqdm(dl):
        for k, v in data.items():
            data[k] = v.to('cuda')

For reference, my conda environment is here

Once again, I’m very excited about this library, and how easy it is to load datasets, and to do so without worrying about system memory constraints.

Thanks for all your great work.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

thomwolfcommented, Jan 21, 2021

We should add it indeed and also maybe a specific section with all the tips for maximal speed. What do you think @lhoestq @SBrandeis @yjernite ?

0reactions

gaceladricommented, Feb 1, 2021

My fault! I had not seen the dataloader_num_workers in TrainingArguments ! Now I can parallelize and go fast! Sorry, and thanks.

Top Results From Across the Web

Sparse YOLOv5: 12x faster and 12x smaller - Neural Magic

Prune and quantize YOLOv5 for a 12x increase in performance and a 12x decrease in model files. Achieve GPU-class performance on CPUs.

Dell Precision Data Science Workstation: Benchmarks and

One machine learning model training benchmark reveals that running on a CPU takes 6.4x longer than on a GPU configuration. While another deep...

Proteus

Data, hardware, and workloads are increasingly heterogeneous, challenging existing system designs and slowing down data scientists' exploration cycles. Proteus ...

Scalable Acceleration of XGBoost Training on Apache Spark ...

So scanning out to a cluster is one way of training on a large data set, but sometimes it can be very slow....

2036 (Failed watershed analysis on Grass) - OSGeo Trac

I guess the reason why an all-in-RAM r.watershed version exists at all is that the all-in-RAM version was really slow and the disk-swap...