question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

deberta-v3 has 100 more vocabs than its tokenizer

See original GitHub issue

System Info

  • transformers version: 4.22.1
  • Platform: macOS-12.6-arm64-arm-64bit
  • Python version: 3.8.13
  • Huggingface_hub version: 0.10.0
  • PyTorch version (GPU?): 1.12.1 (False)
  • Tensorflow version (GPU?): 2.8.1 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

Hi @LysandreJik @SaulLu , I think this issue needs both of you to help or confirm:

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

model_type = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
print(tokenizer.vocab_size) # output: 128000
print(len(tokenizer.vocab)) # output: 128001, the extra one is padding?

config = AutoConfig.from_pretrained(model_type)
print(config.vocab_size) # output: 128100
model = AutoModel.from_pretrained(model_type, config=config)
print(print(len(model.embeddings.word_embeddings.weight)) # 128100, which is consistent with the config

Expected behavior

The deberta model should have the same vocab_size as its tokenizer.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
wenmin-wucommented, Oct 12, 2022

@SaulLu Got another improvement again. I’m a Prize Contender now! Many thanks!

1reaction
wenmin-wucommented, Oct 8, 2022

@SaulLu Got it, thanks a lot for your detailed explanation

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeBERTa - Hugging Face
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >
deberta v2_3 fast tokenizer - Kaggle
The following is necessary if you want to use the fast tokenizer for deberta v2 or v3 ... by -100 to avoid using...
Read more >
TASK-AWARE LAYER-WISE DISTILLA - OpenReview
Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills ...
Read more >
Language Modelling with Pixels – arXiv Vanity
Furthermore, we find that pixel is more robust to noisy text inputs than ... On one end of the spectrum, a vocabulary over...
Read more >
PROCEEDINGS Volume 1: MT Research Track - AMTA
mainstream than ever, we have had to be more selective in the ... contains too many words (100 words is set as default...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found