question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset.map gets stuck on _cast_to_python_objects

See original GitHub issue

Describe the bug

Dataset.map, when fed a Huggingface Tokenizer as its map func, can sometimes spend huge amounts of time doing casts. A minimal example follows.

Not all usages suffer from this. For example, I profiled the preprocessor at https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb , and it did not have this problem. However, I’m at a loss to figure out how it avoids it, as the example below is simple and minimal and still has this problem.

This casting, where it occurs, causes the Dataset.map to run approximately 7x slower than it runs for code which does not cause this casting.

This may be related to https://github.com/huggingface/datasets/issues/1046 . However, the tokenizer is not set to return Tensors.

Steps to reproduce the bug

A minimal, self-contained example to reproduce is below:

import transformers
from transformers import AutoTokenizer
from datasets import load_dataset
import torch
import cProfile

pretrained = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(pretrained)

squad = load_dataset('squad')
squad_train = squad['train']
squad_tiny = squad_train.select(range(5000))

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

def tokenize(ds):
        tokens = tokenizer(text=ds['question'],
                                text_pair=ds['context'],
                                add_special_tokens=True,
                                padding='max_length',
                                truncation='only_second',
                                max_length=160,
                                stride=32,
                                return_overflowing_tokens=True,
                                return_offsets_mapping=True,
                                )
        return tokens

cmd = 'squad_tiny.map(tokenize, batched=True, remove_columns=squad_tiny.column_names)'
cProfile.run(cmd, sort='tottime')

Actual results

The code works, but takes 10-25 sec per batch (about 7x slower than non-casting code), with the following profile. Note that _cast_to_python_objects is the culprit.

    63524075 function calls (58206482 primitive calls) in 121.836 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
5274034/40   68.751    0.000  111.060    2.776 features.py:262(_cast_to_python_objects)
 42223832   24.077    0.000   33.310    0.000 {built-in method builtins.isinstance}
 16338/20    5.121    0.000  111.053    5.553 features.py:361(<listcomp>)
  5274135    4.747    0.000    4.749    0.000 {built-in method _abc._abc_instancecheck}
    80/40    4.731    0.059  116.292    2.907 {pyarrow.lib.array}
  5274135    4.485    0.000    9.234    0.000 abc.py:96(__instancecheck__)
2661564/2645196    2.959    0.000    4.298    0.000 features.py:1081(_check_non_null_non_empty_recursive)
        5    2.786    0.557    2.786    0.557 {method 'encode_batch' of 'tokenizers.Tokenizer' objects}
  2668052    0.930    0.000    0.930    0.000 {built-in method builtins.len}
     5000    0.930    0.000    0.938    0.000 tokenization_utils_fast.py:187(_convert_encoding)
        5    0.750    0.150    0.808    0.162 {method 'to_pydict' of 'pyarrow.lib.Table' objects}
        1    0.444    0.444  121.749  121.749 arrow_dataset.py:2501(_map_single)
       40    0.375    0.009  116.291    2.907 arrow_writer.py:151(__arrow_array__)
       10    0.066    0.007    0.066    0.007 {method 'write_batch' of 'pyarrow.lib._CRecordBatchWriter' objects}
        1    0.060    0.060  121.835  121.835 fingerprint.py:409(wrapper)
11387/5715    0.049    0.000    0.175    0.000 {built-in method builtins.getattr}
       36    0.049    0.001    0.049    0.001 {pyarrow._compute.call_function}
    15000    0.040    0.000    0.040    0.000 _collections_abc.py:719(__iter__)
        3    0.023    0.008    0.023    0.008 {built-in method _imp.create_dynamic}
       77    0.020    0.000    0.020    0.000 {built-in method builtins.dir}
       37    0.019    0.001    0.019    0.001 socket.py:543(send)
       15    0.017    0.001    0.017    0.001 tokenization_utils_fast.py:460(<listcomp>)
  432/421    0.015    0.000    0.024    0.000 traitlets.py:1388(_notify_observers)
     5000    0.015    0.000    0.018    0.000 _collections_abc.py:672(keys)
       51    0.014    0.000    0.042    0.001 traitlets.py:276(getmembers)
        5    0.014    0.003    3.775    0.755 tokenization_utils_fast.py:392(_batch_encode_plus)
      3/1    0.014    0.005    0.035    0.035 {built-in method _imp.exec_dynamic}
        5    0.012    0.002    0.950    0.190 tokenization_utils_fast.py:438(<listcomp>)
    31626    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
1532/1001    0.011    0.000    0.189    0.000 traitlets.py:643(get)
        5    0.009    0.002    3.796    0.759 arrow_dataset.py:2631(apply_function_on_filtered_inputs)
       51    0.009    0.000    0.062    0.001 traitlets.py:1766(traits)
        5    0.008    0.002    3.784    0.757 tokenization_utils_base.py:2632(batch_encode_plus)
      368    0.007    0.000    0.044    0.000 traitlets.py:1715(_get_trait_default_generator)
       26    0.007    0.000    0.022    0.001 traitlets.py:1186(setup_instance)
       51    0.006    0.000    0.010    0.000 traitlets.py:1781(<listcomp>)
    80/32    0.006    0.000    0.052    0.002 table.py:1758(cast_array_to_feature)
      684    0.006    0.000    0.007    0.000 {method 'items' of 'dict' objects}
4344/1794    0.006    0.000    0.192    0.000 traitlets.py:675(__get__)
...

Environment info

I observed this on both Google colab and my local workstation:

Google colab

  • datasets version: 2.3.2
  • Platform: Linux-5.4.188±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.13
  • PyArrow version: 6.0.1
  • Pandas version: 1.3.5

Local

  • datasets version: 2.3.2
  • Platform: Windows-7-6.1.7601-SP1
  • Python version: 3.8.10
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.3

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
szmorocommented, Sep 19, 2022

#take

1reaction
mariosaskocommented, Jul 18, 2022

Hi! Thanks for reporting and providing a reproducible example. Indeed, by default, datasets performs an expensive cast on the values returned by map to convert them to one of the types supported by PyArrow (the underlying storage format used by datasets). This cast is not needed on NumPy arrays as PyArrow supports them natively, so one way to make this transform faster is to add return_tensors="np" to the tokenizer call.

I think we should mention this in the docs (cc @stevhliu)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Processing data in a Dataset - Hugging Face
Dataset. map() should accept an input with the format of an item of the dataset: function(dataset[0]) and return a python dict. The columns...
Read more >
python - How to map a dataset with a function which contains ...
Short answer: you don't. Tensorflow will call the function you pass to Dataset.map in graph mode (it only calls the function once and...
Read more >
Fast, Flexible, Easy and Intuitive: How to Speed Up Your ...
Pandas' HDFStore class allows you to store your DataFrame in an HDF5 file so that it can be accessed efficiently, while still retaining...
Read more >
Multiprocessing.Pool() - Stuck in a Pickle
Once our object IntToBitarrayConverter is created, the object is bound to the method convert(...) . This means when we pass our method to...
Read more >
Work With Datetime Format in Python - Time Series Data
Python provides a datetime object for storing and working with dates. Learn how you can convert columns in a pandas dataframe containing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found