Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

deberta-v3 has 100 more vocabs than its tokenizer

See original GitHub issue

System Info

transformers version: 4.22.1
Platform: macOS-12.6-arm64-arm-64bit
Python version: 3.8.13
Huggingface_hub version: 0.10.0
PyTorch version (GPU?): 1.12.1 (False)
Tensorflow version (GPU?): 2.8.1 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

Hi @LysandreJik @SaulLu , I think this issue needs both of you to help or confirm:

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

model_type = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_type)
print(tokenizer.vocab_size) # output: 128000
print(len(tokenizer.vocab)) # output: 128001, the extra one is padding?

config = AutoConfig.from_pretrained(model_type)
print(config.vocab_size) # output: 128100
model = AutoModel.from_pretrained(model_type, config=config)
print(print(len(model.embeddings.word_embeddings.weight)) # 128100, which is consistent with the config

Expected behavior

The deberta model should have the same vocab_size as its tokenizer.

Issue Analytics

State:
Created a year ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

wenmin-wucommented, Oct 12, 2022

@SaulLu Got another improvement again. I’m a Prize Contender now! Many thanks!

1reaction

wenmin-wucommented, Oct 8, 2022

@SaulLu Got it, thanks a lot for your detailed explanation

Top Results From Across the Web

DeBERTa - Hugging Face

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...

deberta v2_3 fast tokenizer - Kaggle

The following is necessary if you want to use the fast tokenizer for deberta v2 or v3 ... by -100 to avoid using...

TASK-AWARE LAYER-WISE DISTILLA - OpenReview

Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills ...

Language Modelling with Pixels – arXiv Vanity

Furthermore, we find that pixel is more robust to noisy text inputs than ... On one end of the spectrum, a vocabulary over...

PROCEEDINGS Volume 1: MT Research Track - AMTA

mainstream than ever, we have had to be more selective in the ... contains too many words (100 words is set as default...