Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can not init tokenizers from third party model , on albert model

See original GitHub issue

🐛 Bug

Information

Model I am using (albert.):

Language I am using the model on (English, Chinese …):

The problem arises when using:

[ *] the official example scripts: (give details below)

follow the instructions on :

https://huggingface.co/models

such as use “voidful/albert_chinese_tiny” model,

AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')

will raise Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Issue Analytics

State:
Created 3 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

6reactions

voidfulcommented, Apr 4, 2020

Since sentencepiece is not used in albert_chinese model you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

0reactions

voidfulcommented, Aug 3, 2021

Since sentencepiece is not used in albert_chinese model you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM colab trial
from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
Result: 心 0.9422469735145569
I have tried this code from transformers import TFAutoModel, BertTokenizer pretrained = ‘voidful/albert_chinese_xlarge’ tokenizer = BertTokenizer.from_pretrained(pretrained) model = TFAutoModel.from_pretrained(pretrained)

inputs = tokenizer(“我喜欢你!”, return_tensors=“tf”) outputs = model(**inputs)

print(outputs)

it encounters

OSError: Can’t load weights for ‘voidful/albert_chinese_xlarge’. Make sure that:

‘voidful/albert_chinese_xlarge’ is a correct model identifier listed on ‘https://huggingface.co/models’

or ‘voidful/albert_chinese_xlarge’ is the correct path to a directory containing a file named one of tf_model.h5, pytorch_model.bin.

You need to add from_pt=True in order to load a pytorch checkpoint.

from transformers import TFAutoModel, BertTokenizer
pretrained = './albert_chinese_tiny'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = TFAutoModel.from_pretrained(pretrained, from_pt=True)

inputs = tokenizer("我喜欢你!", return_tensors="tf")
outputs = model(**inputs)

Top Results From Across the Web

ALBERT - Hugging Face

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the...

Building a Pipeline for State-of-the-Art Natural Language ...

This talk will focus on the entire NLP pipeline, from text to tokens with huggingface/tokenizers and from tokens to predictions with ...

A Survey of Transformer-based Pretrained Models in ... - arXiv

Abstract—Transformer-based pretrained language models (T-PTLMs) have achieved great success in ... BioBERT [45] is initialized from general BERT and fur-.

How to Train a BERT Model From Scratch

BERT is a powerful NLP model for many language tasks. ... And with those, we can move on to initializing our tokenizer so...

Annotators - Spark NLP

Model suffix is explicitly stated when the annotator is the result ... such as Tokenizer are transformers, but do not contain the word...