question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can not init tokenizers from third party model , on albert model

See original GitHub issue

🐛 Bug

Information

Model I am using (albert.):

Language I am using the model on (English, Chinese …):

The problem arises when using:

  • [ *] the official example scripts: (give details below)

follow the instructions on :

https://huggingface.co/models

such as use “voidful/albert_chinese_tiny” model,

AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')

will raise Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
voidfulcommented, Apr 4, 2020

Since sentencepiece is not used in albert_chinese model you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

0reactions
voidfulcommented, Aug 3, 2021

Since sentencepiece is not used in albert_chinese model you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

I have tried this code from transformers import TFAutoModel, BertTokenizer pretrained = ‘voidful/albert_chinese_xlarge’ tokenizer = BertTokenizer.from_pretrained(pretrained) model = TFAutoModel.from_pretrained(pretrained)

inputs = tokenizer(“我喜欢你!”, return_tensors=“tf”) outputs = model(**inputs)

print(outputs)

it encounters

OSError: Can’t load weights for ‘voidful/albert_chinese_xlarge’. Make sure that:

  • ‘voidful/albert_chinese_xlarge’ is a correct model identifier listed on ‘https://huggingface.co/models
  • or ‘voidful/albert_chinese_xlarge’ is the correct path to a directory containing a file named one of tf_model.h5, pytorch_model.bin.

You need to add from_pt=True in order to load a pytorch checkpoint.

from transformers import TFAutoModel, BertTokenizer
pretrained = './albert_chinese_tiny'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = TFAutoModel.from_pretrained(pretrained, from_pt=True)

inputs = tokenizer("我喜欢你!", return_tensors="tf")
outputs = model(**inputs)
Read more comments on GitHub >

github_iconTop Results From Across the Web

ALBERT - Hugging Face
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...
Read more >
Building a Pipeline for State-of-the-Art Natural Language ...
This talk will focus on the entire NLP pipeline, from text to tokens with huggingface/tokenizers and from tokens to predictions with ...
Read more >
A Survey of Transformer-based Pretrained Models in ... - arXiv
Abstract—Transformer-based pretrained language models (T-PTLMs) have achieved great success in ... BioBERT [45] is initialized from general BERT and fur-.
Read more >
How to Train a BERT Model From Scratch
BERT is a powerful NLP model for many language tasks. ... And with those, we can move on to initializing our tokenizer so...
Read more >
Annotators - Spark NLP
Model suffix is explicitly stated when the annotator is the result ... such as Tokenizer are transformers, but do not contain the word...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found