question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Word+Character-level tokenization

See original GitHub issue

Hi, Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn’t seem supported yet but I may have missed something.

I’d be happy to contribute something to extend torchtext to support this, but I’m not sure what the proper way to handle this would be (ideally it’d be extensible to other tokenization schemes as well, but perhaps that’s a stretch). Thoughts?

Thanks!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
oya163commented, May 7, 2019

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

0reactions
binhnacommented, Oct 18, 2019

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

Have you solved this yet? I am trying to implement word-level combined with char-level using torchtext too

Read more comments on GitHub >

github_iconTop Results From Across the Web

Word, Subword and Character-based tokenization: Know the ...
Tokenization in simple words is the process of splitting a phrase, sentence, paragraph, one or multiple text documents into smaller units.
Read more >
Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
Tokenization in NLP: Types, Challenges, Examples, Tools
Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we...
Read more >
What is Tokenization | Methods to Perform Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or ...
Read more >
Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation),...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found