Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to enable tokenizer padding option in feature extraction pipeline?

See original GitHub issue

I am trying to use our pipeline() to extract features of sentence tokens. Because the lengths of my sentences are not same, and I am then going to feed the token features to RNN-based models, I want to padding sentences to a fixed length to get the same size features. Before knowing our convenient pipeline() method, I am using a general version to get the features, which works fine but inconvenient, like that:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.'

encoded_input = tokenizer(text, padding='max_length', truncation=True, max_length=40)
indexed_tokens = encoded_input['input_ids']
segments_ids = encoded_input['token_type_ids']

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()

with torch.no_grad():
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2]

Then I also need to merge (or select) the features from returned hidden_states by myself… and finally get a [40,768] padded feature for this sentence’s tokens as I want. However, as you can see, it is very inconvenient. Compared to that, the pipeline method works very well and easily, which only needs the following 5-line codes.

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text)

Then I can directly get the tokens’ features of original (length) sentence, which is [22,768].

However, how can I enable the padding option of the tokenizer in pipeline? As I saw #9432 and #9576 , I knew that now we can add truncation options to the pipeline object (here is called nlp), so I imitated and wrote this code:

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text, padding='max_length', truncation=True, max_length=40)

The program did not throw me an error though, but just return me a [512,768] vector…? So is there any method to correctly enable the padding options? Thank you!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jan 19, 2021

Your result if of length 512 because you asked padding="max_length", and the tokenizer max length is 512. If you ask for "longest", it will pad up to the longest value in your batch:

>>> text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
... features = nlp([text, text * 2], padding="longest", truncation=True, max_length=40)

returns features which are of size [42, 768].

0reactions

github-actions[bot]commented, Apr 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

Pipelines - Hugging Face

tokenizer (PreTrainedTokenizer) — The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from PreTrainedTokenizer....

huggingface tokenizer padding - You.com | The search engine you ...

Docs here suggest to use tokenizer for padding, and i really want so, but cannot. Open side panel. Load ...

Using Hugging-face transformer with arguments in pipeline

How can I pass transformer-related arguments for my Pipeline? # These are BERT and tokenizer definitions tokenizer = AutoTokenizer.

Building a Pipeline for State-of-the-Art Natural Language ...

The pipeline from text to prediction remains complex, but tools like ... the normalizer, the tokenizer, the padding options, everything, this means that ......

6.2. Feature extraction — scikit-learn 1.2.0 documentation

The sklearn.feature_extraction module can be used to extract features in a ... Let's use it to tokenize and count the word occurrences of...