How might I use the tokenizers from the HuggingFace Transformers library
See original GitHub issue❓ Questions and Help
Description
TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?
I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue
from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer
path = 'path/to/med_nli/'
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
LABEL = data.LabelField()
fields = {'sentence1': ('premise', TEXT),
'sentence2': ('hypothesis', TEXT),
'gold_label': ('label', LABEL)}
train, valid, test = data.TabularDataset.splits(
path=path,
train='mli_train_v1.jsonl',
validation='mli_dev_v1.jsonl',
test='mli_test_v1.jsonl',
format='json',
fields=fields
)
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
(train, valid, test), batch_sizes=(16, 256, 256)
)
TEXT.build_vocab(train)
LABEL.build_vocab(train)
Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.
But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode
method and set vocab=False
.
from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer
path = 'path/to/med_nli/'
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
LABEL = data.LabelField()
fields = {'sentence1': ('premise', TEXT),
'sentence2': ('hypothesis', TEXT),
'gold_label': ('label', LABEL)}
train, valid, test = data.TabularDataset.splits(
path=path,
train='mli_train_v1.jsonl',
validation='mli_dev_v1.jsonl',
test='mli_test_v1.jsonl',
format='json',
fields=fields
)
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
(train, valid, test), batch_sizes=(16, 256, 256)
)
# TEXT.build_vocab(train)
LABEL.build_vocab(train)
But then I get strange issues when trying to access the batch,
batch = next(iter(train_iter))
print("Numericalize premises:\n", batch.premise)
print("Numericalize hypotheses:\n", batch.hypothesis)
print("Entailment labels:\n", batch.label)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-9919119fad82> in <module>
----> 1 batch = next(iter(train_iter))
~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)
~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)
~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)
~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)
ValueError: too many dimensions 'str'
Any suggestions on how to go about this?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:8
- Comments:21 (8 by maintainers)
Top Results From Across the Web
Use tokenizers from Tokenizers - Hugging Face
The tokenizers obtained from the 🤗 Tokenizers library can be loaded very simply into 🤗 Transformers. We now have a tokenizer trained on...
Read more >Summary of the tokenizers - Hugging Face
Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a text into words or subwords...
Read more >Tokenizer - Hugging Face
A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the...
Read more >Tokenizers - Hugging Face
Train new vocabularies and tokenize, using today's most used tokenizers. · Extremely fast (both training and tokenization), thanks to the Rust implementation.
Read more >Tokenizers - Hugging Face Course
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@JohnGiorgi your code is exactly how this should be done. You don’t use torchtext’s vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in
data.Field
is'<PAD>'
, which is a string (and since you’re not using vocabs, there’s no way to convert it to an index).The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the
pad_token
argument:This works for my case (I copied your code and used it with a different dataset).
In this case, no. build vocab is relevant only when you use a vocab.
On Sun, Oct 6, 2019, 16:00 John Giorgi notifications@github.com wrote: