Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How might I use the tokenizers from the HuggingFace Transformers library

See original GitHub issue

❓ Questions and Help

Description

TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?

I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

TEXT.build_vocab(train)
LABEL.build_vocab(train)

Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.

But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode method and set vocab=False.

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

# TEXT.build_vocab(train)
LABEL.build_vocab(train)

But then I get strange issues when trying to access the batch,

batch = next(iter(train_iter))
print("Numericalize premises:\n", batch.premise)
print("Numericalize hypotheses:\n", batch.hypothesis)
print("Entailment labels:\n", batch.label)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-9919119fad82> in <module>
----> 1 batch = next(iter(train_iter))

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)

ValueError: too many dimensions 'str'

Any suggestions on how to go about this?

Issue Analytics

State:
Created 4 years ago
Reactions:8
Comments:21 (8 by maintainers)

Top GitHub Comments

14reactions

mttkcommented, Oct 3, 2019

@JohnGiorgi your code is exactly how this should be done. You don’t use torchtext’s vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in data.Field is '<PAD>', which is a string (and since you’re not using vocabs, there’s no way to convert it to an index).

The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the pad_token argument:

pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index)

This works for my case (I copied your code and used it with a different dataset).

2reactions

mttkcommented, Oct 6, 2019

In this case, no. build vocab is relevant only when you use a vocab.

On Sun, Oct 6, 2019, 16:00 John Giorgi notifications@github.com wrote:

@mttk https://github.com/mttk Thank you so much. This is exactly what I was looking for!

Final question, do I still need to call TEXT.build_vocab() in my case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/text/issues/609?email_source=notifications&email_token=AAW6LSYJ6J5P6ARRWAH7BSLQNJGXXA5CNFSM4I4VVAC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOT6OA#issuecomment-538787640, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW6LS5XN6OFJXNE5U2B3LDQNJGXXANCNFSM4I4VVACQ .

Top Results From Across the Web

Use tokenizers from Tokenizers - Hugging Face

The tokenizers obtained from the 🤗 Tokenizers library can be loaded very simply into 🤗 Transformers. We now have a tokenizer trained on...

Summary of the tokenizers - Hugging Face

Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...

Tokenizer - Hugging Face

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the...

Tokenizers - Hugging Face

Train new vocabularies and tokenize, using today's most used tokenizers. · Extremely fast (both training and tokenization), thanks to the Rust implementation.

Tokenizers - Hugging Face Course

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be...