bug in transformers notebook (training from scratch)?
See original GitHub issueHello there!
First of all, I cannot thank @Rocketknight1 enough for the amazing work he has been doing to create tensorflow
versions of the notebooks. On my side, I have spent some time and money (colab pro) trying to tie the notebooks together to create a full classifier from scratch with the following steps:
- train the tokenizer
- train the language model
- train de classification head.
Unfortunately, I run into two issues. You can use the fully working notebook pasted below.
First issue: by training my own tokenizer I actually get a perplexity
(225) that is way worse than the example shown https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/language_modeling-tf.ipynb when using
model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
This is puzzling as the tokenizer should be fine-tuned to the data used in the original tf2 notebook!
Second, there seem to be some python issue when I try to fine-tune the language model I obtained above with a text classification head.
Granted, the tokenizer
and the underlying language model
have been trained on another dataset (the wikipedia dataset from the previous two tf2 notebook that is). See https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/text_classification-tf.ipynb . However, I should at least get some valid output! Here the model is complaining about some collate function.
Could you please have a look @sgugger @LysandreJik @Rocketknight1 when you can? I would be very happy to contribute this notebook to the Hugging Face community (although most of the credits go to @Rocketknight1). There is a great demand for building language models and NLP tasks from scratch.
Thanks!!!
Code below
get the most recent versions
!pip install git+https://github.com/huggingface/datasets.git
!pip install transformers
train tokenizer from scratch
from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
batch_size = 1000
def batch_iterator():
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)
tokenizer.post_processor = processors.TemplateProcessing(
single=f"[CLS]:0 $A:0 [SEP]:0",
pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
special_tokens=[
("[CLS]", cls_token_id),
("[SEP]", sep_token_id),
],
)
tokenizer.decoder = decoders.WordPiece(prefix="##")
from transformers import BertTokenizerFast
mytokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
causal language from scratch using my own tokenizer mytokenizer
model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
def tokenize_function(examples):
return mytokenizer(examples["text"], truncation=True)
tokenized_datasets = datasets.map(
tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)
block_size = 128
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=4,
)
from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf
optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)
def dummy_loss(y_true, y_pred):
return tf.reduce_mean(y_pred)
model.compile(optimizer=optimizer, loss={"loss": dummy_loss})
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=mytokenizer, mlm_probability=0.15, return_tensors="tf"
)
train_set = lm_datasets["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
validation_set = lm_datasets["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
model.fit(train_set, validation_data=validation_set, epochs=1)
import math
eval_results = model.evaluate(validation_set)[0]
print(f"Perplexity: {math.exp(eval_results):.2f}")
and fine tune a classification tasks
GLUE_TASKS = [
"cola",
"mnli",
"mnli-mm",
"mrpc",
"qnli",
"qqp",
"rte",
"sst2",
"stsb",
"wnli",
]
task = "sst2"
batch_size = 16
from datasets import load_dataset, load_metric
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)
and now try to classify text
from transformers import AutoTokenizer
task_to_keys = {
"cola": ("sentence", None),
"mnli": ("premise", "hypothesis"),
"mnli-mm": ("premise", "hypothesis"),
"mrpc": ("sentence1", "sentence2"),
"qnli": ("question", "sentence"),
"qqp": ("question1", "question2"),
"rte": ("sentence1", "sentence2"),
"sst2": ("sentence", None),
"stsb": ("sentence1", "sentence2"),
"wnli": ("sentence1", "sentence2"),
}
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")
def preprocess_function(examples):
if sentence2_key is None:
return mytokenizer(examples[sentence1_key], truncation=True)
return mytokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)
pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)
validation_key = (
"validation_mismatched"
if task == "mnli-mm"
else "validation_matched"
if task == "mnli"
else "validation"
)
tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
columns=tokenizer_columns,
label_cols=["label"],
shuffle=True,
batch_size=16,
collate_fn=mytokenizer.pad,
)
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
columns=tokenizer_columns,
label_cols=["label"],
shuffle=False,
batch_size=16,
collate_fn=mytokenizer.pad,
)
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf
num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
if task == "stsb":
loss = tf.keras.losses.MeanSquaredError()
num_labels = 1
elif task.startswith("mnli"):
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
num_labels = 3
else:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
num_labels = 2
model = TFAutoModelForSequenceClassification.from_pretrained(
model, num_labels=num_labels
)
from transformers import create_optimizer
num_epochs = 5
batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)
metric_name = (
"pearson"
if task == "stsb"
else "matthews_correlation"
if task == "cola"
else "accuracy"
)
def compute_metrics(predictions, labels):
if task != "stsb":
predictions = np.argmax(predictions, axis=1)
else:
predictions = predictions[:, 0]
return metric.compute(predictions=predictions, references=labels)
model.fit(
tf_train_dataset,
validation_data=tf_validation_dataset,
epochs=5,
callbacks=tf.keras.callbacks.EarlyStopping(patience=2),
)
predictions = model.predict(tf_validation_dataset)["logits"]
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
---------------------------------------------------------------------------
VisibleDeprecationWarning Traceback (most recent call last)
<ipython-input-42-6eba4122302c> in <module>()
44 shuffle=True,
45 batch_size=16,
---> 46 collate_fn=mytokenizer.pad,
47 )
48 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
9 frames
/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py in _arrow_array_to_numpy(self, pa_array)
165 # cast to list of arrays or we end up with a np.array with dtype object
166 array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167 return np.array(array, copy=False, **self.np_array_kwargs)
168
169
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
What do you think? Happy to help if I can Thanks!!
Issue Analytics
- State:
- Created 2 years ago
- Comments:45 (23 by maintainers)
Hi @randomgambit, sorry for the lengthy delay in replying again! I’m still making changes to some of the lower-level parts of the library, so these notebooks haven’t been fully finalized yet.
The
VisibleDeprecationWarning
in your first post is something that will hopefully be fixed by upcoming changes todatasets
, but for now you can just ignore it.The error you’re getting in your final post is, I think, caused by you overwriting the variable
model
in your code. Thefrom_pretrained()
method expects a string likebert-base-cased
, but it seems like you’ve created an actual TF model with that variable name. If you pass an actual model object tofrom_pretrained()
it’ll get very confused - so make sure that whatever argument you’re passing there is a string and not something else!I’m sure @Rocketknight1 will know what’s going on here 😃