Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bug in transformers notebook (training from scratch)?

See original GitHub issue

Hello there!

First of all, I cannot thank @Rocketknight1 enough for the amazing work he has been doing to create tensorflow versions of the notebooks. On my side, I have spent some time and money (colab pro) trying to tie the notebooks together to create a full classifier from scratch with the following steps:

train the tokenizer
train the language model
train de classification head.

Unfortunately, I run into two issues. You can use the fully working notebook pasted below.

First issue: by training my own tokenizer I actually get a perplexity (225) that is way worse than the example shown https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/language_modeling-tf.ipynb when using

model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

This is puzzling as the tokenizer should be fine-tuned to the data used in the original tf2 notebook!

Second, there seem to be some python issue when I try to fine-tune the language model I obtained above with a text classification head.

Granted, the tokenizer and the underlying language model have been trained on another dataset (the wikipedia dataset from the previous two tf2 notebook that is). See https://github.com/huggingface/notebooks/blob/new_tf_notebooks/examples/text_classification-tf.ipynb . However, I should at least get some valid output! Here the model is complaining about some collate function.

Could you please have a look @sgugger @LysandreJik @Rocketknight1 when you can? I would be very happy to contribute this notebook to the Hugging Face community (although most of the credits go to @Rocketknight1). There is a great demand for building language models and NLP tasks from scratch.

Thanks!!!

Code below

get the most recent versions

!pip install git+https://github.com/huggingface/datasets.git
!pip install  transformers

train tokenizer from scratch


from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
batch_size = 1000

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)
tokenizer.decoder = decoders.WordPiece(prefix="##")

from transformers import BertTokenizerFast
mytokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

causal language from scratch using my own tokenizer mytokenizer


model_checkpoint = "bert-base-uncased"
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize_function(examples):
    return mytokenizer(examples["text"], truncation=True)

tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

from transformers import TFAutoModelForMaskedLM
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)


def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)


model.compile(optimizer=optimizer, loss={"loss": dummy_loss})

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=mytokenizer, mlm_probability=0.15, return_tensors="tf"
)

train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
model.fit(train_set, validation_data=validation_set, epochs=1)
import math

eval_results = model.evaluate(validation_set)[0]
print(f"Perplexity: {math.exp(eval_results):.2f}")

and fine tune a classification tasks

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]


task = "sst2"
batch_size = 16

from datasets import load_dataset, load_metric

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric("glue", actual_task)

and now try to classify text

from transformers import AutoTokenizer

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

def preprocess_function(examples):
    if sentence2_key is None:
        return mytokenizer(examples[sentence1_key], truncation=True)
    return mytokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["label"],
    shuffle=True,
    batch_size=16,
    collate_fn=mytokenizer.pad,
)
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["label"],
    shuffle=False,
    batch_size=16,
    collate_fn=mytokenizer.pad,
)

from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2

if task == "stsb":
    loss = tf.keras.losses.MeanSquaredError()
    num_labels = 1
elif task.startswith("mnli"):
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 3
else:
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 2

model = TFAutoModelForSequenceClassification.from_pretrained(
    model, num_labels=num_labels
)

from transformers import create_optimizer

num_epochs = 5

batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)

metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)

def compute_metrics(predictions, labels):
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=5,
    callbacks=tf.keras.callbacks.EarlyStopping(patience=2),
)

predictions = model.predict(tf_validation_dataset)["logits"]
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d01ad7112f932f9c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-de5efda680a1f856.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0f3c1e00b7f03ba8.arrow
Sentence: hide new secretions from the parental units 
Columns added by tokenizer: ['attention_mask', 'input_ids', 'token_type_ids']
---------------------------------------------------------------------------
VisibleDeprecationWarning                 Traceback (most recent call last)
<ipython-input-42-6eba4122302c> in <module>()
     44     shuffle=True,
     45     batch_size=16,
---> 46     collate_fn=mytokenizer.pad,
     47 )
     48 tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(

9 frames
/usr/local/lib/python3.7/dist-packages/datasets/formatting/formatting.py in _arrow_array_to_numpy(self, pa_array)
    165             # cast to list of arrays or we end up with a np.array with dtype object
    166             array: List[np.ndarray] = pa_array.to_numpy(zero_copy_only=zero_copy_only).tolist()
--> 167         return np.array(array, copy=False, **self.np_array_kwargs)
    168 
    169 

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

What do you think? Happy to help if I can Thanks!!

Issue Analytics

State:
Created 2 years ago
Comments:45 (23 by maintainers)

Top GitHub Comments

3reactions

Rocketknight1commented, Sep 24, 2021

Hi @randomgambit, sorry for the lengthy delay in replying again! I’m still making changes to some of the lower-level parts of the library, so these notebooks haven’t been fully finalized yet.

The VisibleDeprecationWarning in your first post is something that will hopefully be fixed by upcoming changes to datasets, but for now you can just ignore it.

The error you’re getting in your final post is, I think, caused by you overwriting the variable model in your code. The from_pretrained() method expects a string like bert-base-cased, but it seems like you’ve created an actual TF model with that variable name. If you pass an actual model object to from_pretrained() it’ll get very confused - so make sure that whatever argument you’re passing there is a string and not something else!

3reactions

sguggercommented, Sep 17, 2021

I’m sure @Rocketknight1 will know what’s going on here 😃