How to share vocabulary across fields?
See original GitHub issueI’m new to torchtext. I want to use two fields which should have the same vocabulary. The only difference is that field2
prepends an <sos>
token to every sequence. I have the following code:
field1 = ReversibleField(torch.LongTensor, tokenize=tokenizer)
field2 = ReversibleField(torch.LongTensor, tokenize=tokenizer, init_token='<sos>')
dataset = TabularDataset(path='train.json', format='json',
fields={'x': ('x', field1), 'y': ('y', field2)})
field1.build_vocab(dataset, max_size=30000)
Now I want field2
to use the vocab
of field1
. I tried field2.vocab = field1.vocab
, but this results in an error in later processing. According to documentation, the only way to force the same vocabulary seems to be to use the same field, but setting the init_token
dynamically isn’t possible when the dataset
is read by a BucketIterator
.
My current workaround is to first save field1
, and then load it also as field2
:
field1 = ReversibleField(torch.LongTensor, tokenize=tokenizer, init_token='<sos>')
# ... build vocab as before
torch.save(field1, 'field1.pt')
field2 = torch.load('field1.pt')
field1.init_token = None
However, I don’t know if this will also work if the fields are associated with word vectors. Since the vocab
of field1
and field2
are essentially two independent copies, updates to the word embeddings will have to be performed on both fields. Alternatively, I can manually create a vocabulary dict
and use it to initialize two copies of Vocab
, but how can I make the two fields to use them?
Is there a recommended way to share vocabulary?
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (3 by maintainers)
Top GitHub Comments
Exactly the same question, is there any official way to share vocabulary ACROSS fields? Thanks!
I think this maybe is a method!