question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenization in C++

See original GitHub issue

Is there any general strategy for tokenizing text in C++ in a way that’s compatible with the existing pretrained BertTokenizer implementation? I’m looking to use a finetuned BERT model in C++ for inference, and currently the only way seems to be to reproduce the BertTokenizer code manually (or modify it to be compatible with torchscript). Has anyone come up with a better solution than this?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

8reactions
thomwolfcommented, Dec 11, 2019

You should wait a few days if you can because @n1t0 is working on something that will very likely solve your problem and it should be ready for a first release before the end of the year.

6reactions
MarkJGxcommented, Mar 29, 2021

Why was this closed? https://github.com/huggingface/tokenizers offers no C++ solution other than developing a Rust -> C++ interop wrapper yourself, which wouldn’t work in my case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizing strings in C
strtok has an internal state variable tracking the string being tokenized. When you pass NULL to it, strtok will continue to use this...
Read more >
Tokenization (The C Preprocessor)
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an ...
Read more >
String tokenisation function in C
In this section, we will see how to tokenize strings in C. The C has library function for this. The C library function...
Read more >
STR06-C. Do not assume that strtok() leaves the parse ...
The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified...
Read more >
Tokenizing a string in C++
Just like strtok() function in C, strtok_r() does the same task of parsing a string into a sequence of tokens. strtok_r() is a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found