Feature Request: Word+Character-level tokenization
See original GitHub issueHi, Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn’t seem supported yet but I may have missed something.
I’d be happy to contribute something to extend torchtext
to support this, but I’m not sure what the proper way to handle this would be (ideally it’d be extensible to other tokenization schemes as well, but perhaps that’s a stretch). Thoughts?
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Word, Subword and Character-based tokenization: Know the ...
Tokenization in simple words is the process of splitting a phrase, sentence, paragraph, one or multiple text documents into smaller units.
Read more >Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >Tokenization in NLP: Types, Challenges, Examples, Tools
Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we...
Read more >What is Tokenization | Methods to Perform Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or ...
Read more >Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation),...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.
Have you solved this yet? I am trying to implement word-level combined with char-level using torchtext too