Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Vocab() is broken: getting errors when providing keyword arguments to function: `init() got an unexpected keyword argument 'min_freq'`

See original GitHub issue

🐛 Bug

Describe the bug I was working through the migration notebook to understand the new API. The Vocab() function seems very broken, at least in google colab? I am getting an error when I try and create a Vocab() with the min_freq=10 setting. But even after I removed this setting, I am getting an error TypeError: __init__() got an unexpected keyword argument 'specials'. So this suggests that Vocab is not recognizing any of the keyword arguments mentioned in the API. I was using Torchtext 0.11.0 with Pytorch 1.10.0+cu111 in a google colab notebook.

To Reproduce

start a google colab notebook.
follow the migration tutorial in the torchtext repo.
Enter the following line and this generates the error.

from collections import Counter
from torchtext.vocab import Vocab

train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

The following error and stacktrace is generated.

TypeError                                 Traceback (most recent call last)

<ipython-input-8-e5262609a934> in <module>()
      6 for (label, line) in train_iter:
      7     counter.update(tokenizer(line))
----> 8 vocab = Vocab(counter, min_freq=1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

TypeError: __init__() got an unexpected keyword argument 'min_freq'

But even after removing the min_freq=1, I still get the error.

train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

I get the error message:

TypeError                                 Traceback (most recent call last)

<ipython-input-13-39009faace9c> in <module>()
      6 for (label, line) in train_iter:
      7     counter.update(tokenizer(line))
----> 8 vocab = Vocab(counter, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

TypeError: __init__() got an unexpected keyword argument 'specials'

Expected behavior This code should generate a Vocabulary with only words that occur a minimum of 10 times.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python -c "import torchtext; print(\"torchtext version is \", torchtext.__version__)"

PyTorch Version (e.g., 1.0): 1.10
OS (e.g., Linux): Google Colab notebook with gpu.
How you installed PyTorch (conda, pip, source): pytorch was already installed
Build command you used (if compiling from source): NA
Python version: 3.7.2
CUDA/cuDNN version: unknown
GPU models and configuration: unknown–whatever provided by colab at the time.
Any other relevant information:

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

parmeetcommented, Nov 17, 2021

I would strongly recommend to follow version >=0.10.0.

Yes, that’s our goal to standardize around pytorch dataloaders and datasets, which is why we also deprecated our legacy data abstractions APIs and created torchtext datasets using Iterable datasets. Please do follow the developments on main branch where we are also adding support for Model APIs in upcoming releases. Thank you for your feedback 😃

1reaction

00krishnacommented, Nov 17, 2021

@parmeet Ahhh okay so that makes sense. Thanks for pointing out the docs–I was confused about the difference I encountered between 0.9.0 and 0.11.0, but now that make sense as part of the design. So the tutorial is only valid for 0.9.0 and then look at the other docs for 0.10.0. That make sense.

So just to clarify, what is the best strategy for now? Like should I stabilize around 0.9.0, or is it better to follow 0.10.0? I was not sure which API is the most stable one for now.

I actually really like how the new version is using the standard pytorch DataLoaders 😃.