question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pipelines do not control input sequences longer than those accepted by the model

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): DistilBERT

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

  1. Create a “sentiment-analysis” pipeline with a DistilBERT tokenizer and model
  2. Prepare a string that will produce more than 512 tokens upon tokenization
  3. Run the pipeline over such input string
from transformers import pipeline

pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
very_long_text = "This is a very long text" * 100
pipe(very_long_text)

Expected behavior

The pipeline should control in some way that the input string will not overflow the maximum number of tokens the model can accept, for instance by limiting the number of tokens generated in the tokenization step. The user can’t control this beforehand, as the tokenizer is run by the pipeline itself and it can be hard to predict into how many tokens a given text will be broken down to.

One possible way of addressing this might be to include optional parameters in the pipeline constructor that are forwarded to the tokenizer.

The current error trace is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-ef48faf7ffbb> in <module>
      3 pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
      4 very_long_text = "This is a very long text" * 100
----> 5 pipe(very_long_text)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    714 
    715     def __call__(self, *args, **kwargs):
--> 716         outputs = super().__call__(*args, **kwargs)
    717         scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
    718         return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    469     def __call__(self, *args, **kwargs):
    470         inputs = self._parse_and_tokenize(*args, **kwargs)
--> 471         return self._forward(inputs)
    472 
    473     def _forward(self, inputs, return_tensors=False):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
    488                 with torch.no_grad():
    489                     inputs = self.ensure_tensor_on_device(**inputs)
--> 490                     predictions = self.model(**inputs)[0].cpu()
    491 
    492         if return_tensors:

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels)
    609         """
    610         distilbert_output = self.distilbert(
--> 611             input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds
    612         )
    613         hidden_state = distilbert_output[0]  # (bs, seq_len, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds)
    464 
    465         if inputs_embeds is None:
--> 466             inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
    467         tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)
    468         hidden_state = tfmr_output[0]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids)
     89 
     90         word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
---> 91         position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
     92 
     93         embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1482         # remove once script supports set_grad_enabled
   1483         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1485 
   1486 

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /tmp/pip-req-build-808afw3c/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

Environment info

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_pytorch_select           0.2                       gpu_0
_tflow_select             2.1.0                       gpu
absl-py                   0.9.0                    py36_0
asn1crypto                1.3.0                    py36_0
astor                     0.8.0                    py36_0
attrs                     19.3.0                     py_0
backcall                  0.1.0                    py36_0
blas                      1.0                         mkl
bleach                    3.1.4                      py_0
boto3                     1.12.47                  pypi_0    pypi
botocore                  1.15.47                  pypi_0    pypi
c-ares                    1.15.0            h7b6447c_1001
ca-certificates           2020.1.1                      0
certifi                   2020.4.5.1               py36_0
cffi                      1.14.0           py36h2e261b9_0
chardet                   3.0.4                 py36_1003
click                     7.1.2                    pypi_0    pypi
cloudpickle               1.3.0                      py_0
cryptography              2.8              py36h1ba5d50_0
cudatoolkit               10.1.243             h6bb024c_0
cudnn                     7.6.5                cuda10.1_0
cupti                     10.1.168                      0
cycler                    0.10.0                   py36_0
cytoolz                   0.10.1           py36h7b6447c_0
dask-core                 2.15.0                     py_0
dataclasses               0.7                      pypi_0    pypi
dbus                      1.13.12              h746ee38_0
decorator                 4.4.2                      py_0
defusedxml                0.6.0                      py_0
docutils                  0.15.2                   pypi_0    pypi
eli5                      0.10.1                   pypi_0    pypi
entrypoints               0.3                      py36_0
expat                     2.2.6                he6710b0_0
filelock                  3.0.12                   pypi_0    pypi
fontconfig                2.13.0               h9420a91_0
freetype                  2.9.1                h8a8886c_1
gast                      0.3.3                      py_0
glib                      2.63.1               h5a9c865_0
gmp                       6.1.2                h6c8ec71_1
google-pasta              0.2.0                      py_0
grpcio                    1.27.2           py36hf8bcb03_0
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
h5py                      2.10.0           py36h7918eee_0
hdf5                      1.10.4               hb1b8bf9_0
icu                       58.2                 h9c2bf20_1
idna                      2.8                      py36_0
imageio                   2.8.0                      py_0
importlib_metadata        1.5.0                    py36_0
intel-openmp              2020.0                      166
ipykernel                 5.1.4            py36h39e3cac_0
ipython                   7.13.0           py36h5ca1d4c_0
ipython_genutils          0.2.0                    py36_0
ipywidgets                7.5.1                      py_0
jedi                      0.16.0                   py36_1
jinja2                    2.11.1                     py_0
jmespath                  0.9.5                    pypi_0    pypi
joblib                    0.14.1                     py_0
jpeg                      9b                   h024ee3a_2
json5                     0.9.4                    pypi_0    pypi
jsonschema                3.2.0                    py36_0
jupyter                   1.0.0                    py36_7
jupyter_client            6.1.2                      py_0
jupyter_console           6.1.0                      py_0
jupyter_core              4.6.3                    py36_0
jupyterlab                2.1.2                    pypi_0    pypi
jupyterlab-server         1.1.4                    pypi_0    pypi
keras-applications        1.0.8                      py_0
keras-base                2.3.1                    py36_0
keras-gpu                 2.3.1                         0
keras-preprocessing       1.1.0                      py_1
kiwisolver                1.1.0            py36he6710b0_0
ld_impl_linux-64          2.33.1               h53a641e_7
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libprotobuf               3.11.4               hd408876_0
libsodium                 1.0.16               h1bed415_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_0
libuuid                   1.0.3                h1bed415_2
libxcb                    1.13                 h1bed415_1
libxml2                   2.9.9                hea5a465_1
markdown                  3.1.1                    py36_0
markupsafe                1.1.1            py36h7b6447c_0
matplotlib                2.2.2            py36hb69df0a_2
mistune                   0.8.4            py36h7b6447c_0
mkl                       2020.0                      166
mkl-service               2.3.0            py36he904b0f_0
mkl_fft                   1.0.15           py36ha843d7b_0
mkl_random                1.1.0            py36hd6b4f25_0
nb_conda                  2.2.1                    py36_0
nb_conda_kernels          2.2.3                    py36_0
nbconvert                 5.6.1                    py36_0
nbformat                  5.0.4                      py_0
ncurses                   6.2                  he6710b0_0
networkx                  2.4                        py_0
ninja                     1.9.0            py36hfd86e86_0
notebook                  6.0.3                    py36_0
numpy                     1.18.1           py36h4f9e942_0
numpy-base                1.18.1           py36hde5b4d6_1
olefile                   0.46                     py36_0
openssl                   1.1.1g               h7b6447c_0
packaging                 20.3                       py_0
pandas                    0.23.0           py36h637b7d7_0
pandoc                    2.2.3.2                       0
pandocfilters             1.4.2                    py36_1
parso                     0.6.2                      py_0
pcre                      8.43                 he6710b0_0
pexpect                   4.8.0                    py36_0
pickleshare               0.7.5                    py36_0
pillow                    7.0.0            py36hb39fc2d_0
pip                       19.3.1                   py36_0
prometheus_client         0.7.1                      py_0
prompt-toolkit            3.0.4                      py_0
prompt_toolkit            3.0.4                         0
protobuf                  3.11.4           py36he6710b0_0
ptyprocess                0.6.0                    py36_0
pycparser                 2.20                       py_0
pygments                  2.6.1                      py_0
pyopenssl                 19.1.0                   py36_0
pyparsing                 2.4.6                      py_0
pyqt                      5.9.2            py36h05f1152_2
pyrsistent                0.16.0           py36h7b6447c_0
pysocks                   1.7.1                    py36_0
python                    3.6.10               hcf32534_1
python-dateutil           2.8.1                      py_0
python-graphviz           0.14                     pypi_0    pypi
pytorch                   1.4.0           cuda101py36h02f0884_0
pytz                      2019.3                     py_0
pywavelets                1.1.1            py36h7b6447c_0
pyyaml                    5.3.1            py36h7b6447c_0
pyzmq                     18.1.1           py36he6710b0_0
qt                        5.9.7                h5867ecd_1
qtconsole                 4.7.3                      py_0
qtpy                      1.9.0                      py_0
readline                  8.0                  h7b6447c_0
regex                     2020.4.4                 pypi_0    pypi
requests                  2.22.0                   py36_1
s3transfer                0.3.3                    pypi_0    pypi
sacremoses                0.0.41                   pypi_0    pypi
scikit-image              0.14.2           py36he6710b0_0
scikit-learn              0.22.1           py36hd81dba3_0
scikit-optimize           0.5.2                    pypi_0    pypi
scipy                     1.4.1            py36h0b6359f_0
send2trash                1.5.0                    py36_0
sentencepiece             0.1.86                   pypi_0    pypi
setuptools                46.1.3                   py36_0
sip                       4.19.8           py36hf484d3e_0
six                       1.14.0                   py36_0
sqlite                    3.31.1               h62c20be_1
tabulate                  0.8.7                    pypi_0    pypi
tensorboard               1.14.0           py36hf484d3e_0
tensorflow                1.14.0          gpu_py36h3fb9ad6_0
tensorflow-base           1.14.0          gpu_py36he45bfe2_0
tensorflow-estimator      1.14.0                     py_0
tensorflow-gpu            1.14.0               h0d30ee6_0
termcolor                 1.1.0                    py36_1
terminado                 0.8.3                    py36_0
testpath                  0.4.4                      py_0
tk                        8.6.8                hbc83047_0
tokenizers                0.7.0                    pypi_0    pypi
toolz                     0.10.0                     py_0
torchvision               0.5.0                py36_cu101    pytorch
tornado                   6.0.4            py36h7b6447c_1
tqdm                      4.45.0                   pypi_0    pypi
traitlets                 4.3.3                    py36_0
transformers              2.9.1                    pypi_0    pypi
urllib3                   1.25.8                   py36_0
wcwidth                   0.1.9                      py_0
webencodings              0.5.1                    py36_1
werkzeug                  1.0.1                      py_0
wheel                     0.34.2                   py36_0
widgetsnbextension        3.5.1                    py36_0
wrapt                     1.12.1           py36h7b6447c_1
xz                        5.2.5                h7b6447c_0
yaml                      0.1.7                had09818_2
zeromq                    4.3.1                he6710b0_3
zipp                      2.2.0                      py_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.3.7                h0b5b093_0
  • Platform: Linux matrix 4.4.0-174-generic #204-Ubuntu SMP Wed Jan 29 06:41:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Python version: Python 3.6.10 :: Anaconda, Inc.
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, May 23, 2020

I think the problem is the following. Here: https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L463 The input is encoded and has a length of 701 which is larger then self.tokenizer.model_max_length so that the forward pass of the model crashes.

A simple fix would be to add a statement like:

if inputs['input_ids'].shape[-1] > self.tokenizer.model_max_length: 
        logger.warn("Input is cut....")
        inputs['input_ids'] = input['input_ids'][:, :self.tokenizer.model_max_length]
```, but I am not sure whether this is the best solution.

I think the best solution would actually be to return a clean error message here and suggest to the user to use the option `max_length=512` for the tokenizer. The problem currently is though that when calling:

```python 
pipe(very_long_text)

no arguments for the batch_encode_plus function can be inserted because of two reasons:

  1. Current the TextClassificationPipeline cannot accept a mixture of kwargs and args, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L141
  2. The batch_encode_plus function actually does not accept any **kwargs arguments currently, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L464

IMO, it would be a good idea to do a larger refactoring here where we allow the pipelines to be more flexible so that batch_encode_plus **kwargs can easily be inserted. @LysandreJik

1reaction
jordanparker6commented, Nov 3, 2021

It is not working for me either. Code to reproduce error is below.

text = ["The Wallabies are going to win the RWC in 2023."]
 ner = pipeline(
            task="ner", 
            model=AutoModelForTokenClassification.from_pretrained(ner_model),
            tokenizer=AutoTokenizer.from_pretrained(ner_model),
            aggregation_strategy="average"
        )
ner(text, trucation=True)

Error message is:

_sanitize_parameters() got an unexpected keyword argument 'truncation'

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pipelines - Hugging Face
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...
Read more >
Pipeline Steps - Amazon SageMaker - AWS Documentation
Describes the step types in Amazon SageMaker Model Building Pipelines. ... these keys must be primitive types, and nested objects are not supported....
Read more >
Build and Release Tasks - Azure Pipelines | Microsoft Learn
Understand Build and Release tasks in Azure Pipelines and Team Foundation Server (TFS)
Read more >
Data analysis and modeling pipelines for controlled ... - NCBI
Automating these steps can lead not only to improved productivity, but also to ... the pipeline tasks to complete; (ii) control accessing of...
Read more >
Pipeline Risk Modeling Overview of Methods and Tools for ...
Appendix D – Migration from Older Risk Analysis Methods to Quantitative Models . ... Model with inputs that are quantities or probability.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found