Regd Rare Words/OOV Tokens ?
See original GitHub issueNeed a few clarifications regarding how to handle rare words and heuristics in the configuration
- How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?
- What happens if
POS_UNK
is disabled to False ? - Does it follow this approach ? A brief summary of it can be found here
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Dynamic out-of-vocabulary word registration to language ...
We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted ...
Read more >An example of OOV word registration - ResearchGate
We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted OOV tokens, ...
Read more >Word Tokenization: How to Handle Out-Of-Vocabulary ...
Rare words should be decomposed into meaningful subwords. Subwords help identify similar syntactic or semantic situations in texts. Subword.
Read more >Boosting OOV tokens recognition in slot tagging with ...
In this paper, we propose a novel knowledge-aware slot tagging model to integrate contextual representation of input text and the large-scale ...
Read more >Linguistic Features · spaCy Usage Documentation
Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If the files set in the config (https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L16-L18) have been already processed by BPE, you don’t want to set
TOKENIZATION_METHOD=tokenize_bpe
because it would apply the segmentation twice. In that case you should setTOKENIZATION_METHOD=tokenize_none
.Yes, feel free to open a PR describing how you did this. I can review it.
Okay ! Will try that right now
Finally is to possible to run subword based nmt with this ?