question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How might I use the tokenizers from the HuggingFace Transformers library

See original GitHub issue

❓ Questions and Help

Description

TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?

I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

TEXT.build_vocab(train)
LABEL.build_vocab(train)

Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.

But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode method and set vocab=False.

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

# TEXT.build_vocab(train)
LABEL.build_vocab(train)

But then I get strange issues when trying to access the batch,

batch = next(iter(train_iter))
print("Numericalize premises:\n", batch.premise)
print("Numericalize hypotheses:\n", batch.hypothesis)
print("Entailment labels:\n", batch.label)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-9919119fad82> in <module>
----> 1 batch = next(iter(train_iter))

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)

ValueError: too many dimensions 'str'

Any suggestions on how to go about this?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:8
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

14reactions
mttkcommented, Oct 3, 2019

@JohnGiorgi your code is exactly how this should be done. You don’t use torchtext’s vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in data.Field is '<PAD>', which is a string (and since you’re not using vocabs, there’s no way to convert it to an index).

The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the pad_token argument:

pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index)

This works for my case (I copied your code and used it with a different dataset).

2reactions
mttkcommented, Oct 6, 2019

In this case, no. build vocab is relevant only when you use a vocab.

On Sun, Oct 6, 2019, 16:00 John Giorgi notifications@github.com wrote:

@mttk https://github.com/mttk Thank you so much. This is exactly what I was looking for!

Final question, do I still need to call TEXT.build_vocab() in my case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/text/issues/609?email_source=notifications&email_token=AAW6LSYJ6J5P6ARRWAH7BSLQNJGXXA5CNFSM4I4VVAC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOT6OA#issuecomment-538787640, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW6LS5XN6OFJXNE5U2B3LDQNJGXXANCNFSM4I4VVACQ .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use tokenizers from Tokenizers - Hugging Face
The tokenizers obtained from the 🤗 Tokenizers library can be loaded very simply into 🤗 Transformers. We now have a tokenizer trained on...
Read more >
Summary of the tokenizers - Hugging Face
Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...
Read more >
Tokenizer - Hugging Face
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the...
Read more >
Tokenizers - Hugging Face
Train new vocabularies and tokenize, using today's most used tokenizers. · Extremely fast (both training and tokenization), thanks to the Rust implementation.
Read more >
Tokenizers - Hugging Face Course
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found