Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Where do classes get added as special tokens?

See original GitHub issue

Issue Description


I’ve implemented Donut as a fork of HuggingFace Transformers, and soon I’ll add it to the library. The model is implemented as an instance of VisionEncoderDecoderModel, which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT, GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here’s a notebook that shows inference with it.

I do have 2 questions though:

  • I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")

when using this dataset when creating an instance of DonutDataset, it seems only “<s_class>”, “</s_class>” and “<s_rvlcdip>” are added as special tokens. But looking at this file, it seems that one also defines special tokens for each class. Looking at the code, it seems only keys are added, not values of the dictionaries.

  • I’ve uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels ML Engineer @ HuggingFace

Issue Analytics

  • State:closed
  • Created 7 months ago
  • Reactions:5
  • Comments:6

github_iconTop GitHub Comments

NielsRoggecommented, Aug 5, 2022


Thanks for updating that 😃

Regarding uploading the checkpoints, I can open up PRs on your repos. I’ll open a PR on the Transformers repository today to add the model to the library. Will update you.

NielsRoggecommented, Aug 12, 2022

Hi @gwkrsrch,

I’ve opened PRs on all 8 repos. Feel free to review and merge them 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens ...
Read more >
How to add some new special tokens to a pretrained tokenizer?
Hi guys. I want to add some new special tokens like [XXX] to a pretrained ByteLevelBPETokenizer, but I can't find how to do...
Read more >
How to add new special token to the tokenizer? - Stack Overflow
I want to build a multi-class classification ...
Read more >
Adding a new token to a transformer model without breaking ...
You can add the tokens as special tokens, similar to [SEP] or [CLS] using the add_special_tokens method. There will be separated during ...
Read more >
tokenizer - AllenNLP v2.10.1
class Tokenizer(Registrable). A Tokenizer splits strings of text into tokens. Typically, this either splits ... Special tokens will be added as appropriate.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found