Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difficulty using the package due to outdated documentation and lack of examples

See original GitHub issue

📚 Documentation

Description

I’m trying to use torchtext for review rating prediction, but the new API is not well documented yet. I tried to learn from the migration jupyter notebook but it fails on cell #10 if I change torchtext version to 0.10.0.

There are a few things that remain are unclear to me:

I would imagine Vectors to require specific tokenizer as they might encode special symbols differently? Is this correct? How does GloVe treat special characters?
Are vector fixed or do they receive gradients as well?
Are there current roadmaps on reviewing the documentation?

Thanks Pedro

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

parmeetcommented, Jul 1, 2021

@parmeet there was this build_vocab_from_iterator that was in torchtext.vocab in 0.9.0, but couldn’t find it in 0.10.0. I think it’s deprecated now ? or moved to legacy ?

Actually we have it for new vocab as well, refer here for documentation. Also refer to section “creating Vocab from text file” and “Backward Incompatible changes” in release notes for additional details and usage.

def build_vocab_from_iterator(iterator, num_lines=None, *args, **kwargs):
    """
    Build a Vocab from an iterator.

    Args:
        iterator: Iterator used to build Vocab. Must yield list or iterator of tokens.
        num_lines: The expected number of elements returned by the iterator.
            (Default: None)
            Optionally, if known, the expected number of elements can be passed to
            this factory function for improved progress reporting.
    """

    counter = Counter()
    with tqdm(unit_scale=0, unit='lines', total=num_lines) as t:
        for tokens in iterator:
            counter.update(tokens)
            t.update(1)
    word_vocab = Vocab(counter, *args, **kwargs)
    return word_vocab

I also modified it to support *args and **kwargs. The reason being now i can do this: build_vocab_from_iterator(apply_transforms(data), len(data), specials=['<unk>', '<pad>', '<bos>', '<eos>'])

1reaction

parmeetcommented, Jul 1, 2021

📚 Documentation

Description

I’m trying to use torchtext for review rating prediction, but the new API is not well documented yet. I tried to learn from the migration jupyter notebook but it fails on cell #10 if I change torchtext version to 0.10.0.

Thanks @pedropgusmao for bring this up. Yes, you are right this cell is not functional for version 0.10.0 due to update in Vocab. I will update this.

There are a few things that remain are unclear to me:

I would imagine Vectors to require specific tokenizer as they might encode special symbols differently? Is this correct? How does GloVe treat special characters?

This is a good question. I would suggest to refer to original source of vectors to learn more about how to tokenize. For example refer here for GloVe and FastText. We do not explicitly encode any special symbols and provide wrapper for what’s available as part of original source vectors. For unknown token queries, we simply return zero tensor by default (or initialized with specific value provided by user) with same dimension as original source vectors.

Are vector fixed or do they receive gradients as well?

Vectors are simply containers that maps tokens to their corresponding vector representation. If you want your vectors to be trainable, I would suggest to use nn.embeddings

update: Related issue #1350

Are there current roadmaps on reviewing the documentation?

I really appreciate you bringing this up. Please do suggest or feel free to raise issues wherever you find the documentation is not appropriate. We will try our best to address it. Please note that with this new release (0.10.0), we have deprecated the legacy vocab and replaced it with new Vocab module. You can find additional details in the release note and refer to the documentation here. Also I would suggest to learn more through the tutorials here that are already updated with regard to latest features like iterable datasets and new Vocab module.