Pipelines do not control input sequences longer than those accepted by the model
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): DistilBERT
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
- Create a “sentiment-analysis” pipeline with a DistilBERT tokenizer and model
- Prepare a string that will produce more than 512 tokens upon tokenization
- Run the pipeline over such input string
from transformers import pipeline
pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
very_long_text = "This is a very long text" * 100
pipe(very_long_text)
Expected behavior
The pipeline should control in some way that the input string will not overflow the maximum number of tokens the model can accept, for instance by limiting the number of tokens generated in the tokenization step. The user can’t control this beforehand, as the tokenizer is run by the pipeline itself and it can be hard to predict into how many tokens a given text will be broken down to.
One possible way of addressing this might be to include optional parameters in the pipeline constructor that are forwarded to the tokenizer.
The current error trace is:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-1-ef48faf7ffbb> in <module>
3 pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
4 very_long_text = "This is a very long text" * 100
----> 5 pipe(very_long_text)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
714
715 def __call__(self, *args, **kwargs):
--> 716 outputs = super().__call__(*args, **kwargs)
717 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
718 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
469 def __call__(self, *args, **kwargs):
470 inputs = self._parse_and_tokenize(*args, **kwargs)
--> 471 return self._forward(inputs)
472
473 def _forward(self, inputs, return_tensors=False):
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
488 with torch.no_grad():
489 inputs = self.ensure_tensor_on_device(**inputs)
--> 490 predictions = self.model(**inputs)[0].cpu()
491
492 if return_tensors:
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels)
609 """
610 distilbert_output = self.distilbert(
--> 611 input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds
612 )
613 hidden_state = distilbert_output[0] # (bs, seq_len, dim)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds)
464
465 if inputs_embeds is None:
--> 466 inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
467 tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)
468 hidden_state = tfmr_output[0]
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids)
89
90 word_embeddings = self.word_embeddings(input_ids) # (bs, max_seq_length, dim)
---> 91 position_embeddings = self.position_embeddings(position_ids) # (bs, max_seq_length, dim)
92
93 embeddings = word_embeddings + position_embeddings # (bs, max_seq_length, dim)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
112 return F.embedding(
113 input, self.weight, self.padding_idx, self.max_norm,
--> 114 self.norm_type, self.scale_grad_by_freq, self.sparse)
115
116 def extra_repr(self):
~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1482 # remove once script supports set_grad_enabled
1483 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1485
1486
RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /tmp/pip-req-build-808afw3c/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
Environment info
# Name Version Build Channel
_libgcc_mutex 0.1 main
_pytorch_select 0.2 gpu_0
_tflow_select 2.1.0 gpu
absl-py 0.9.0 py36_0
asn1crypto 1.3.0 py36_0
astor 0.8.0 py36_0
attrs 19.3.0 py_0
backcall 0.1.0 py36_0
blas 1.0 mkl
bleach 3.1.4 py_0
boto3 1.12.47 pypi_0 pypi
botocore 1.15.47 pypi_0 pypi
c-ares 1.15.0 h7b6447c_1001
ca-certificates 2020.1.1 0
certifi 2020.4.5.1 py36_0
cffi 1.14.0 py36h2e261b9_0
chardet 3.0.4 py36_1003
click 7.1.2 pypi_0 pypi
cloudpickle 1.3.0 py_0
cryptography 2.8 py36h1ba5d50_0
cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
cupti 10.1.168 0
cycler 0.10.0 py36_0
cytoolz 0.10.1 py36h7b6447c_0
dask-core 2.15.0 py_0
dataclasses 0.7 pypi_0 pypi
dbus 1.13.12 h746ee38_0
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
docutils 0.15.2 pypi_0 pypi
eli5 0.10.1 pypi_0 pypi
entrypoints 0.3 py36_0
expat 2.2.6 he6710b0_0
filelock 3.0.12 pypi_0 pypi
fontconfig 2.13.0 h9420a91_0
freetype 2.9.1 h8a8886c_1
gast 0.3.3 py_0
glib 2.63.1 h5a9c865_0
gmp 6.1.2 h6c8ec71_1
google-pasta 0.2.0 py_0
grpcio 1.27.2 py36hf8bcb03_0
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb453b48_1
h5py 2.10.0 py36h7918eee_0
hdf5 1.10.4 hb1b8bf9_0
icu 58.2 h9c2bf20_1
idna 2.8 py36_0
imageio 2.8.0 py_0
importlib_metadata 1.5.0 py36_0
intel-openmp 2020.0 166
ipykernel 5.1.4 py36h39e3cac_0
ipython 7.13.0 py36h5ca1d4c_0
ipython_genutils 0.2.0 py36_0
ipywidgets 7.5.1 py_0
jedi 0.16.0 py36_1
jinja2 2.11.1 py_0
jmespath 0.9.5 pypi_0 pypi
joblib 0.14.1 py_0
jpeg 9b h024ee3a_2
json5 0.9.4 pypi_0 pypi
jsonschema 3.2.0 py36_0
jupyter 1.0.0 py36_7
jupyter_client 6.1.2 py_0
jupyter_console 6.1.0 py_0
jupyter_core 4.6.3 py36_0
jupyterlab 2.1.2 pypi_0 pypi
jupyterlab-server 1.1.4 pypi_0 pypi
keras-applications 1.0.8 py_0
keras-base 2.3.1 py36_0
keras-gpu 2.3.1 0
keras-preprocessing 1.1.0 py_1
kiwisolver 1.1.0 py36he6710b0_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.11.4 hd408876_0
libsodium 1.0.16 h1bed415_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_0
libuuid 1.0.3 h1bed415_2
libxcb 1.13 h1bed415_1
libxml2 2.9.9 hea5a465_1
markdown 3.1.1 py36_0
markupsafe 1.1.1 py36h7b6447c_0
matplotlib 2.2.2 py36hb69df0a_2
mistune 0.8.4 py36h7b6447c_0
mkl 2020.0 166
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.0.15 py36ha843d7b_0
mkl_random 1.1.0 py36hd6b4f25_0
nb_conda 2.2.1 py36_0
nb_conda_kernels 2.2.3 py36_0
nbconvert 5.6.1 py36_0
nbformat 5.0.4 py_0
ncurses 6.2 he6710b0_0
networkx 2.4 py_0
ninja 1.9.0 py36hfd86e86_0
notebook 6.0.3 py36_0
numpy 1.18.1 py36h4f9e942_0
numpy-base 1.18.1 py36hde5b4d6_1
olefile 0.46 py36_0
openssl 1.1.1g h7b6447c_0
packaging 20.3 py_0
pandas 0.23.0 py36h637b7d7_0
pandoc 2.2.3.2 0
pandocfilters 1.4.2 py36_1
parso 0.6.2 py_0
pcre 8.43 he6710b0_0
pexpect 4.8.0 py36_0
pickleshare 0.7.5 py36_0
pillow 7.0.0 py36hb39fc2d_0
pip 19.3.1 py36_0
prometheus_client 0.7.1 py_0
prompt-toolkit 3.0.4 py_0
prompt_toolkit 3.0.4 0
protobuf 3.11.4 py36he6710b0_0
ptyprocess 0.6.0 py36_0
pycparser 2.20 py_0
pygments 2.6.1 py_0
pyopenssl 19.1.0 py36_0
pyparsing 2.4.6 py_0
pyqt 5.9.2 py36h05f1152_2
pyrsistent 0.16.0 py36h7b6447c_0
pysocks 1.7.1 py36_0
python 3.6.10 hcf32534_1
python-dateutil 2.8.1 py_0
python-graphviz 0.14 pypi_0 pypi
pytorch 1.4.0 cuda101py36h02f0884_0
pytz 2019.3 py_0
pywavelets 1.1.1 py36h7b6447c_0
pyyaml 5.3.1 py36h7b6447c_0
pyzmq 18.1.1 py36he6710b0_0
qt 5.9.7 h5867ecd_1
qtconsole 4.7.3 py_0
qtpy 1.9.0 py_0
readline 8.0 h7b6447c_0
regex 2020.4.4 pypi_0 pypi
requests 2.22.0 py36_1
s3transfer 0.3.3 pypi_0 pypi
sacremoses 0.0.41 pypi_0 pypi
scikit-image 0.14.2 py36he6710b0_0
scikit-learn 0.22.1 py36hd81dba3_0
scikit-optimize 0.5.2 pypi_0 pypi
scipy 1.4.1 py36h0b6359f_0
send2trash 1.5.0 py36_0
sentencepiece 0.1.86 pypi_0 pypi
setuptools 46.1.3 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.14.0 py36_0
sqlite 3.31.1 h62c20be_1
tabulate 0.8.7 pypi_0 pypi
tensorboard 1.14.0 py36hf484d3e_0
tensorflow 1.14.0 gpu_py36h3fb9ad6_0
tensorflow-base 1.14.0 gpu_py36he45bfe2_0
tensorflow-estimator 1.14.0 py_0
tensorflow-gpu 1.14.0 h0d30ee6_0
termcolor 1.1.0 py36_1
terminado 0.8.3 py36_0
testpath 0.4.4 py_0
tk 8.6.8 hbc83047_0
tokenizers 0.7.0 pypi_0 pypi
toolz 0.10.0 py_0
torchvision 0.5.0 py36_cu101 pytorch
tornado 6.0.4 py36h7b6447c_1
tqdm 4.45.0 pypi_0 pypi
traitlets 4.3.3 py36_0
transformers 2.9.1 pypi_0 pypi
urllib3 1.25.8 py36_0
wcwidth 0.1.9 py_0
webencodings 0.5.1 py36_1
werkzeug 1.0.1 py_0
wheel 0.34.2 py36_0
widgetsnbextension 3.5.1 py36_0
wrapt 1.12.1 py36h7b6447c_1
xz 5.2.5 h7b6447c_0
yaml 0.1.7 had09818_2
zeromq 4.3.1 he6710b0_3
zipp 2.2.0 py_0
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
- Platform: Linux matrix 4.4.0-174-generic #204-Ubuntu SMP Wed Jan 29 06:41:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
- Python version: Python 3.6.10 :: Anaconda, Inc.
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:11 (3 by maintainers)
Top Results From Across the Web
Pipelines - Hugging Face
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...
Read more >Pipeline Steps - Amazon SageMaker - AWS Documentation
Describes the step types in Amazon SageMaker Model Building Pipelines. ... these keys must be primitive types, and nested objects are not supported....
Read more >Build and Release Tasks - Azure Pipelines | Microsoft Learn
Understand Build and Release tasks in Azure Pipelines and Team Foundation Server (TFS)
Read more >Data analysis and modeling pipelines for controlled ... - NCBI
Automating these steps can lead not only to improved productivity, but also to ... the pipeline tasks to complete; (ii) control accessing of...
Read more >Pipeline Risk Modeling Overview of Methods and Tools for ...
Appendix D – Migration from Older Risk Analysis Methods to Quantitative Models . ... Model with inputs that are quantities or probability.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think the problem is the following. Here: https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L463 The input is encoded and has a length of 701 which is larger then
self.tokenizer.model_max_length
so that the forward pass of the model crashes.A simple fix would be to add a statement like:
no arguments for the
batch_encode_plus
function can be inserted because of two reasons:TextClassificationPipeline
cannot accept a mixture ofkwargs
andargs
, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L141batch_encode_plus
function actually does not accept any **kwargs arguments currently, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L464IMO, it would be a good idea to do a larger refactoring here where we allow the pipelines to be more flexible so that
batch_encode_plus
**kwargs can easily be inserted. @LysandreJikIt is not working for me either. Code to reproduce error is below.
Error message is:
_sanitize_parameters() got an unexpected keyword argument 'truncation'