Inserting special tokens
See original GitHub issueSay I want to insert special tokens into a piece of text to help the model distinguish certain features.
With BERT, I can use tokens like [unused0], [unused1], ...
. Are there similar tokens I can use with DeCLUTR?
For context, I’m using special tokens to delineate the boundaries of named entity mentions. For example, inserting special tokens into
Jim bought 300 shares of Acme Corp. in 2006.
yields
Jim bought 300 shares of [unused0] Acme Corp. [unused1] in 2006.
How can I do this with DeCLUTR? Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Utilities for Tokenizers - Hugging Face
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >To Insert Special Tokens - PTC Support
To Insert Special Tokens. 1. In the Text Properties dialog box, under Text, click on the Insert special token drop down menu. The...
Read more >Adding special tokens | Python - DataCamp
To add these special tokens, you will use the Python string.join() function. string.join() joins a list of strings to a single string using...
Read more >How to add new special token to the tokenizer? - Stack Overflow
1 Answer 1 · i have added [EOT] token to the tokenizer using add_tokens. then i added [EOT] in data after every turn....
Read more >How to add some new special tokens to a pretrained tokenizer?
All I want to do is add all standard special tokens in case they aren't there e.g. <s> sep token is not in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I can’t think of why it would, but I would double-check to be sure! Might be worth it to dig into the
resize_token_embeddings
function.Oh so resizing the model input embeddings would not force a re-train from scratch? If so I will likely try that. Thank you.