Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Word+Character-level tokenization

See original GitHub issue

Hi, Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn’t seem supported yet but I may have missed something.

I’d be happy to contribute something to extend torchtext to support this, but I’m not sure what the proper way to handle this would be (ideally it’d be extensible to other tokenization schemes as well, but perhaps that’s a stretch). Thoughts?

Thanks!

Issue Analytics

State:
Created 6 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

oya163commented, May 7, 2019

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

0reactions

binhnacommented, Oct 18, 2019

I am also stuck in the same problem. Trying to implement BiLSTM+CNN with the help of torchtext, but I seem to get lost. If there is a clear direction, it would be a great help.

Have you solved this yet? I am trying to implement word-level combined with char-level using torchtext too

Top Results From Across the Web

Word, Subword and Character-based tokenization: Know the ...

Tokenization in simple words is the process of splitting a phrase, sentence, paragraph, one or multiple text documents into smaller units.

Summary of the tokenizers - Hugging Face

As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...

Tokenization in NLP: Types, Challenges, Examples, Tools

Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we...

What is Tokenization | Methods to Perform Tokenization

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or ...

Tokenizer reference | Elasticsearch Guide [8.5] | Elastic

The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation),...