Dataset.map gets stuck on _cast_to_python_objects
See original GitHub issueDescribe the bug
Dataset.map
, when fed a Huggingface Tokenizer as its map func, can sometimes spend huge amounts of time doing casts. A minimal example follows.
Not all usages suffer from this. For example, I profiled the preprocessor at https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb , and it did not have this problem. However, I’m at a loss to figure out how it avoids it, as the example below is simple and minimal and still has this problem.
This casting, where it occurs, causes the Dataset.map
to run approximately 7x slower than it runs for code which does not cause this casting.
This may be related to https://github.com/huggingface/datasets/issues/1046 . However, the tokenizer is not set to return Tensors.
Steps to reproduce the bug
A minimal, self-contained example to reproduce is below:
import transformers
from transformers import AutoTokenizer
from datasets import load_dataset
import torch
import cProfile
pretrained = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
squad = load_dataset('squad')
squad_train = squad['train']
squad_tiny = squad_train.select(range(5000))
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
def tokenize(ds):
tokens = tokenizer(text=ds['question'],
text_pair=ds['context'],
add_special_tokens=True,
padding='max_length',
truncation='only_second',
max_length=160,
stride=32,
return_overflowing_tokens=True,
return_offsets_mapping=True,
)
return tokens
cmd = 'squad_tiny.map(tokenize, batched=True, remove_columns=squad_tiny.column_names)'
cProfile.run(cmd, sort='tottime')
Actual results
The code works, but takes 10-25 sec per batch (about 7x slower than non-casting code), with the following profile. Note that _cast_to_python_objects
is the culprit.
63524075 function calls (58206482 primitive calls) in 121.836 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5274034/40 68.751 0.000 111.060 2.776 features.py:262(_cast_to_python_objects)
42223832 24.077 0.000 33.310 0.000 {built-in method builtins.isinstance}
16338/20 5.121 0.000 111.053 5.553 features.py:361(<listcomp>)
5274135 4.747 0.000 4.749 0.000 {built-in method _abc._abc_instancecheck}
80/40 4.731 0.059 116.292 2.907 {pyarrow.lib.array}
5274135 4.485 0.000 9.234 0.000 abc.py:96(__instancecheck__)
2661564/2645196 2.959 0.000 4.298 0.000 features.py:1081(_check_non_null_non_empty_recursive)
5 2.786 0.557 2.786 0.557 {method 'encode_batch' of 'tokenizers.Tokenizer' objects}
2668052 0.930 0.000 0.930 0.000 {built-in method builtins.len}
5000 0.930 0.000 0.938 0.000 tokenization_utils_fast.py:187(_convert_encoding)
5 0.750 0.150 0.808 0.162 {method 'to_pydict' of 'pyarrow.lib.Table' objects}
1 0.444 0.444 121.749 121.749 arrow_dataset.py:2501(_map_single)
40 0.375 0.009 116.291 2.907 arrow_writer.py:151(__arrow_array__)
10 0.066 0.007 0.066 0.007 {method 'write_batch' of 'pyarrow.lib._CRecordBatchWriter' objects}
1 0.060 0.060 121.835 121.835 fingerprint.py:409(wrapper)
11387/5715 0.049 0.000 0.175 0.000 {built-in method builtins.getattr}
36 0.049 0.001 0.049 0.001 {pyarrow._compute.call_function}
15000 0.040 0.000 0.040 0.000 _collections_abc.py:719(__iter__)
3 0.023 0.008 0.023 0.008 {built-in method _imp.create_dynamic}
77 0.020 0.000 0.020 0.000 {built-in method builtins.dir}
37 0.019 0.001 0.019 0.001 socket.py:543(send)
15 0.017 0.001 0.017 0.001 tokenization_utils_fast.py:460(<listcomp>)
432/421 0.015 0.000 0.024 0.000 traitlets.py:1388(_notify_observers)
5000 0.015 0.000 0.018 0.000 _collections_abc.py:672(keys)
51 0.014 0.000 0.042 0.001 traitlets.py:276(getmembers)
5 0.014 0.003 3.775 0.755 tokenization_utils_fast.py:392(_batch_encode_plus)
3/1 0.014 0.005 0.035 0.035 {built-in method _imp.exec_dynamic}
5 0.012 0.002 0.950 0.190 tokenization_utils_fast.py:438(<listcomp>)
31626 0.012 0.000 0.012 0.000 {method 'append' of 'list' objects}
1532/1001 0.011 0.000 0.189 0.000 traitlets.py:643(get)
5 0.009 0.002 3.796 0.759 arrow_dataset.py:2631(apply_function_on_filtered_inputs)
51 0.009 0.000 0.062 0.001 traitlets.py:1766(traits)
5 0.008 0.002 3.784 0.757 tokenization_utils_base.py:2632(batch_encode_plus)
368 0.007 0.000 0.044 0.000 traitlets.py:1715(_get_trait_default_generator)
26 0.007 0.000 0.022 0.001 traitlets.py:1186(setup_instance)
51 0.006 0.000 0.010 0.000 traitlets.py:1781(<listcomp>)
80/32 0.006 0.000 0.052 0.002 table.py:1758(cast_array_to_feature)
684 0.006 0.000 0.007 0.000 {method 'items' of 'dict' objects}
4344/1794 0.006 0.000 0.192 0.000 traitlets.py:675(__get__)
...
Environment info
I observed this on both Google colab and my local workstation:
Google colab
datasets
version: 2.3.2- Platform: Linux-5.4.188±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- PyArrow version: 6.0.1
- Pandas version: 1.3.5
Local
datasets
version: 2.3.2- Platform: Windows-7-6.1.7601-SP1
- Python version: 3.8.10
- PyArrow version: 8.0.0
- Pandas version: 1.4.3
Issue Analytics
- State:
- Created a year ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
#take
Hi! Thanks for reporting and providing a reproducible example. Indeed, by default,
datasets
performs an expensive cast on the values returned bymap
to convert them to one of the types supported by PyArrow (the underlying storage format used bydatasets
). This cast is not needed on NumPy arrays as PyArrow supports them natively, so one way to make this transform faster is to addreturn_tensors="np"
to the tokenizer call.I think we should mention this in the docs (cc @stevhliu)