Vocab() is broken: getting errors when providing keyword arguments to function: `__init__() got an unexpected keyword argument 'min_freq'`
See original GitHub issue🐛 Bug
Describe the bug
I was working through the migration notebook to understand the new API. The Vocab()
function seems very broken, at least in google colab? I am getting an error when I try and
create a Vocab()
with the min_freq=10
setting. But even after I removed this setting, I am getting an error TypeError: __init__() got an unexpected keyword argument 'specials'
. So this suggests that Vocab is not recognizing any of the keyword arguments mentioned in the API. I was using Torchtext 0.11.0 with Pytorch 1.10.0+cu111 in a google colab notebook.
To Reproduce
- start a google colab notebook.
- follow the migration tutorial in the torchtext repo.
- Enter the following line and this generates the error.
from collections import Counter
from torchtext.vocab import Vocab
train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
The following error and stacktrace is generated.
TypeError Traceback (most recent call last)
<ipython-input-8-e5262609a934> in <module>()
6 for (label, line) in train_iter:
7 counter.update(tokenizer(line))
----> 8 vocab = Vocab(counter, min_freq=1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
TypeError: __init__() got an unexpected keyword argument 'min_freq'
But even after removing the min_freq=1
, I still get the error.
train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = Vocab(counter, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
I get the error message:
TypeError Traceback (most recent call last)
<ipython-input-13-39009faace9c> in <module>()
6 for (label, line) in train_iter:
7 counter.update(tokenizer(line))
----> 8 vocab = Vocab(counter, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
TypeError: __init__() got an unexpected keyword argument 'specials'
Expected behavior This code should generate a Vocabulary with only words that occur a minimum of 10 times.
Screenshots If applicable, add screenshots to help explain your problem.
Environment
Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python -c "import torchtext; print(\"torchtext version is \", torchtext.__version__)"
- PyTorch Version (e.g., 1.0): 1.10
- OS (e.g., Linux): Google Colab notebook with gpu.
- How you installed PyTorch (
conda
,pip
, source): pytorch was already installed - Build command you used (if compiling from source): NA
- Python version: 3.7.2
- CUDA/cuDNN version: unknown
- GPU models and configuration: unknown–whatever provided by colab at the time.
- Any other relevant information:
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
I would strongly recommend to follow version >=0.10.0.
Yes, that’s our goal to standardize around pytorch dataloaders and datasets, which is why we also deprecated our legacy data abstractions APIs and created torchtext datasets using Iterable datasets. Please do follow the developments on main branch where we are also adding support for Model APIs in upcoming releases. Thank you for your feedback 😃
@parmeet Ahhh okay so that makes sense. Thanks for pointing out the docs–I was confused about the difference I encountered between 0.9.0 and 0.11.0, but now that make sense as part of the design. So the tutorial is only valid for 0.9.0 and then look at the other docs for 0.10.0. That make sense.
So just to clarify, what is the best strategy for now? Like should I stabilize around 0.9.0, or is it better to follow 0.10.0? I was not sure which API is the most stable one for now.
I actually really like how the new version is using the standard pytorch DataLoaders 😃.