can not init tokenizers from third party model , on albert model
See original GitHub issue🐛 Bug
Information
Model I am using (albert.):
Language I am using the model on (English, Chinese …):
The problem arises when using:
- [ *] the official example scripts: (give details below)
follow the instructions on :
https://huggingface.co/models
such as use “voidful/albert_chinese_tiny” model,
AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')
will raise
Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
ALBERT - Hugging Face
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >Building a Pipeline for State-of-the-Art Natural Language ...
This talk will focus on the entire NLP pipeline, from text to tokens with huggingface/tokenizers and from tokens to predictions with ...
Read more >A Survey of Transformer-based Pretrained Models in ... - arXiv
Abstract—Transformer-based pretrained language models (T-PTLMs) have achieved great success in ... BioBERT [45] is initialized from general BERT and fur-.
Read more >How to Train a BERT Model From Scratch
BERT is a powerful NLP model for many language tasks. ... And with those, we can move on to initializing our tokenizer so...
Read more >Annotators - Spark NLP
Model suffix is explicitly stated when the annotator is the result ... such as Tokenizer are transformers, but do not contain the word...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Since sentencepiece is not used in albert_chinese model you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM
colab trial
Result:
心 0.9422469735145569
You need to add
from_pt=True
in order to load a pytorch checkpoint.