bug in gpt2 notebook (in tensorflow)
See original GitHub issueHello there!
I tried to use the language-modeling-from-scratch notebook https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling_from_scratch.ipynb#scrollTo=JEA1ju653l-p
More specifically, I need to run it by using tensorflow
. The simple strategy of using the TF
versions of the huggingface
functions everything seems to work correctly until I reach the trainer
step and then I get a mysterious cardinality issue.
This looks like a bug… Can you please have a look at the code below?
model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns = ['text'])
block_size = 128
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000
)
print(tokenizer.decode(lm_datasets['train'][2]["input_ids"]))
from transformers import AutoConfig, TFAutoModelForCausalLM
config = AutoConfig.from_pretrained(model_checkpoint)
model = TFAutoModelForCausalLM.from_config(config)
from transformers import TFTrainer, TFTrainingArguments
training_args = TFTrainingArguments(
"test-clm",
evaluation_strategy = "epoch",
learning_rate=2e-5)
trainer = TFTrainer(
model=model,
args = training_args,
train_dataset=lm_datasets)
trainer.train()
Traceback (most recent call last):
File "<ipython-input-82-01e49a077e43>", line 11, in <module>
trainer.train()
File "C:\Users\john\anaconda3\envs\keras\lib\site-packages\transformers\trainer_tf.py", line 472, in train
train_ds = self.get_train_tfdataset()
File "C:\Users\john\anaconda3\envs\keras\lib\site-packages\transformers\trainer_tf.py", line 150, in get_train_tfdataset
self.num_train_examples = self.train_dataset.cardinality().numpy()
AttributeError: 'DatasetDict' object has no attribute 'cardinality'
What do you think? Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Need to install tensorflow 1.12.0 for gpt-2, not working #45890
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No · OS Platform and Distribution (e.g.,...
Read more >transformers bug in gpt2 notebook (in tensorflow) - Python - GitAnswer
5 Answer: Hey! There are a couple of issues here. The first is that we're trying to move away from TFTrainer towards Keras...
Read more >Fine-tuning GPT2 for text-generation with TensorFlow
I'm trying to fine-tune gpt2 with TensorFlow on my apple m1: Here's my code, following the guide on the course: import os import...
Read more >Step-by-step guide on how to train GPT-2 on books using ...
Step-by-step guide on how to train GPT-2 on books using Google Colab · Preparing your Google Colab Notebook · Download and merge texts...
Read more >Optimizing T5 and GPT-2 for Real-Time Inference with NVIDIA ...
TensorRT 8.2 optimizes HuggingFace T5 and GPT-2 models. ... on the latest product updates, bug fixes, content, best practices, and more.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey! There are a couple of issues here. The first is that we’re trying to move away from TFTrainer towards Keras - there’ll be a new version of that notebook coming soon, like I promised!
In the meantime, your approach should work, though. The error you’re getting is because
lm_datasets
is actually aDatasetDict
containing both the train and validation set, so everything downstream gets confused. You probably want to swap outlm_datasets
forlm_datasets['train']
in that call toTFTrainer
. However, like I said, we’re trying to deprecate TFTrainer, so I’m trying to avoid doing any more bugfixing for it. I’m working on getting the new examples in ASAP!The good news is I’m moving to working on those TF notebooks right now, so hopefully I’ll have a proper example to show you soon. However, the official launch of the new notebooks might depend on the PR at https://github.com/huggingface/datasets/pull/2731 being accepted and making it to release, since I’m planning to use that new method in a lot of them.
Still, I’ll make sure to ping you as soon as I have a LM example ready - just be aware that you might have to install a pre-release version of
datasets
to get it to work!