Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regd Rare Words/OOV Tokens ?

See original GitHub issue

Need a few clarifications regarding how to handle rare words and heuristics in the configuration

How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?
What happens if POS_UNK is disabled to False ?
Does it follow this approach ? A brief summary of it can be found here

Issue Analytics

State:
Created 3 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

lvapeabcommented, Apr 21, 2020

If the files set in the config (https://github.com/lvapeab/nmt-keras/blob/3f97677bcd017ea9a68a79ecd0df10b907cfebcb/config.py#L16-L18) have been already processed by BPE, you don’t want to set TOKENIZATION_METHOD=tokenize_bpe because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none.

Maybe a update about this script in README ?

Yes, feel free to open a PR describing how you did this. I can review it.

1reaction

VP007-pycommented, Apr 19, 2020

Okay ! Will try that right now

Finally is to possible to run subword based nmt with this ?

Top Results From Across the Web

Dynamic out-of-vocabulary word registration to language ...

We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted ...

An example of OOV word registration - ResearchGate

We propose a method of dynamically registering out-of-vocabulary (OOV) words by assigning the pronunciations of these words to pre-inserted OOV tokens, ...

Word Tokenization: How to Handle Out-Of-Vocabulary ...

Rare words should be decomposed into meaningful subwords. Subwords help identify similar syntactic or semantic situations in texts. Subword.

Boosting OOV tokens recognition in slot tagging with ...

In this paper, we propose a novel knowledge-aware slot tagging model to integrate contextual representation of input text and the large-scale ...

Linguistic Features · spaCy Usage Documentation

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the...