question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Difficulty using the package due to outdated documentation and lack of examples

See original GitHub issue

šŸ“š Documentation

Description

Iā€™m trying to use torchtext for review rating prediction, but the new API is not well documented yet. I tried to learn from the migration jupyter notebook but it fails on cell #10 if I change torchtext version to 0.10.0.

There are a few things that remain are unclear to me:

  • I would imagine Vectors to require specific tokenizer as they might encode special symbols differently? Is this correct? How does GloVe treat special characters?
  • Are vector fixed or do they receive gradients as well?
  • Are there current roadmaps on reviewing the documentation?

Thanks Pedro

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
parmeetcommented, Jul 1, 2021

@parmeet there was this build_vocab_from_iterator that was in torchtext.vocab in 0.9.0, but couldnā€™t find it in 0.10.0. I think itā€™s deprecated now ? or moved to legacy ?

Actually we have it for new vocab as well, refer here for documentation. Also refer to section ā€œcreating Vocab from text fileā€ and ā€œBackward Incompatible changesā€ in release notes for additional details and usage.

def build_vocab_from_iterator(iterator, num_lines=None, *args, **kwargs):
    """
    Build a Vocab from an iterator.

    Args:
        iterator: Iterator used to build Vocab. Must yield list or iterator of tokens.
        num_lines: The expected number of elements returned by the iterator.
            (Default: None)
            Optionally, if known, the expected number of elements can be passed to
            this factory function for improved progress reporting.
    """

    counter = Counter()
    with tqdm(unit_scale=0, unit='lines', total=num_lines) as t:
        for tokens in iterator:
            counter.update(tokens)
            t.update(1)
    word_vocab = Vocab(counter, *args, **kwargs)
    return word_vocab

I also modified it to support *args and **kwargs. The reason being now i can do this: build_vocab_from_iterator(apply_transforms(data), len(data), specials=['<unk>', '<pad>', '<bos>', '<eos>'])

1reaction
parmeetcommented, Jul 1, 2021

šŸ“š Documentation

Description

Iā€™m trying to use torchtext for review rating prediction, but the new API is not well documented yet. I tried to learn from the migration jupyter notebook but it fails on cell #10 if I change torchtext version to 0.10.0.

Thanks @pedropgusmao for bring this up. Yes, you are right this cell is not functional for version 0.10.0 due to update in Vocab. I will update this.

There are a few things that remain are unclear to me:

  • I would imagine Vectors to require specific tokenizer as they might encode special symbols differently? Is this correct? How does GloVe treat special characters?

This is a good question. I would suggest to refer to original source of vectors to learn more about how to tokenize. For example refer here for GloVe and FastText. We do not explicitly encode any special symbols and provide wrapper for whatā€™s available as part of original source vectors. For unknown token queries, we simply return zero tensor by default (or initialized with specific value provided by user) with same dimension as original source vectors.

  • Are vector fixed or do they receive gradients as well?

Vectors are simply containers that maps tokens to their corresponding vector representation. If you want your vectors to be trainable, I would suggest to use nn.embeddings

update: Related issue #1350

  • Are there current roadmaps on reviewing the documentation?

I really appreciate you bringing this up. Please do suggest or feel free to raise issues wherever you find the documentation is not appropriate. We will try our best to address it. Please note that with this new release (0.10.0), we have deprecated the legacy vocab and replaced it with new Vocab module. You can find additional details in the release note and refer to the documentation here. Also I would suggest to learn more through the tutorials here that are already updated with regard to latest features like iterable datasets and new Vocab module.

Thanks Pedro

Read more comments on GitHub >

github_iconTop Results From Across the Web

Outdated Documentation - Ubuntu Community Hub
The problem with it is that the vast majority of it is completely out of date (more on this later). While the official...
Read more >
Outdated Document Management: 3 Things Your Business ...
Outdated Document Management - Here are 3 things your business needs to stop believing about outdated document management.
Read more >
6 Problems Caused by Inefficient Document Management
4.) Problem: Manual processes have outdated or no security measures - filing cabinets are rapidly becoming outdated.
Read more >
Poor Documentation: Why It Happens and How to Fix It
Combs agrees: "The most common cause of poor documentation is a lack of understanding of the specific information that needs to be included...
Read more >
6 Reasons You Should Stop Using PDF for Business Content
Here's why you should ditch PDFs and switch to a more engaging, intelligent, mobile-friendly format instead.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found