Tokenization in C++
See original GitHub issueIs there any general strategy for tokenizing text in C++ in a way that’s compatible with the existing pretrained BertTokenizer
implementation?
I’m looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer
code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Tokenizing strings in C
strtok has an internal state variable tracking the string being tokenized. When you pass NULL to it, strtok will continue to use this...
Read more >Tokenization (The C Preprocessor)
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an ...
Read more >String tokenisation function in C
In this section, we will see how to tokenize strings in C. The C has library function for this. The C library function...
Read more >STR06-C. Do not assume that strtok() leaves the parse ...
The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified...
Read more >Tokenizing a string in C++
Just like strtok() function in C, strtok_r() does the same task of parsing a string into a sequence of tokens. strtok_r() is a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.
Why was this closed? https://github.com/huggingface/tokenizers offers no C++ solution other than developing a Rust -> C++ interop wrapper yourself, which wouldn’t work in my case.