question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regd Rare Words/OOV Tokens ?

See original GitHub issue

Need a few clarifications regarding how to handle rare words and heuristics in the configuration

  • How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?
  • What happens if POS_UNK is disabled to False ?
  • Does it follow this approach ? A brief summary of it can be found here

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
lvapeabcommented, Apr 21, 2020

If the files set in the config (https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L16-L18) have been already processed by BPE, you don’t want to set TOKENIZATION_METHOD=tokenize_bpe because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none.

Maybe a update about this script in README ?

Yes, feel free to open a PR describing how you did this. I can review it.

1reaction
VP007-pycommented, Apr 19, 2020

Okay ! Will try that right now

Finally is to possible to run subword based nmt with this ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dynamic out-of-vocabulary word registration to language ...
We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted ...
Read more >
An example of OOV word registration - ResearchGate
We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted OOV tokens, ...
Read more >
Word Tokenization: How to Handle Out-Of-Vocabulary ...
Rare words should be decomposed into meaningful subwords. Subwords help identify similar syntactic or semantic situations in texts. Subword.
Read more >
Boosting OOV tokens recognition in slot tagging with ...
In this paper, we propose a novel knowledge-aware slot tagging model to integrate contextual representation of input text and the large-scale ...
Read more >
Linguistic Features · spaCy Usage Documentation
Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found