TFTrainer with TPUs: Here's a suggestion on getting it to work
See original GitHub issueEnvironment info
transformers
version: 3.0.2- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0+cu101 (False)
- Tensorflow version (GPU?): 2.3.0 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: Yes
Analysis and Temporary Solution
The approach in TFTrainer
and TFTrainingArguments
is really good, but it’s not working right now on TPUs. It looks like we need to do some work on updating the trainer. There are a number of errors on this, the common being gradient accumulation (#6479) and Unable to parse tensor proto
. Since Julien is on vacation, here’s some things I did to get it to train on Colab with TPUs. It’s hacky, but should be able to get it to work if you’re anxious to use TPUs in until Julien has a fix:
- The
strategy
loading order inTFTrainingArguments
andTFTrainer
doesn’t play well with a typical workflow (process data, createtraining_args
, load model and pass toTFTrainer
). The model needs to be loaded after the strategy has been initialized, and right now the strategy is being initialized inside ofTFTrainer.
- Shuffle, batch etc. need to be called prior to instantiating the strategy. I think this has something to do with the way the strategy is defined in
TFTrainingArguments
. - Calling
training_args.training_batch_size
automatically calculates the number of TPU cores. Unfortunately, this causes the strategy to initialize, so this cannot be used to calculatetotal_train_batch_size
with the current strategy implementation because it will prematurely initialize before shuffle, batch, etc. are done. - To avoid the
Unable to parse tensor proto
, shuffle, batch etc. will need to be pulled fromTFTrainer
. They’re handled by theTFTrainer
methodget_train_tfdataset
. With the current strategy implementation in TFTrainingArguments, you’ll need to do that after shuffle, batch and before loading the model.
Example with GPT2
Here’s a example implementing the above changes:
# Note: you'll need to build transformers from source
# Grab a temporary version of TFTrainer with get_train_tfdataset pulled out
git clone https://github.com/alexorona/lm_tf_trainer
from lm_tf_trainer import LMTFTrainer
# Pulled out of TFTrainer
def get_train_tfdataset(train_dataset,
training_args,
train_batch_size,
gradient_accumulation_steps,
dataloader_drop_last = False,
seed = 40):
total_train_batch_size = train_batch_size * gradient_accumulation_steps
num_train_examples = tf.data.experimental.cardinality(train_dataset).numpy()
if num_train_examples < 0:
raise ValueError("The training dataset must have an asserted cardinality")
ds = (
train_dataset.repeat()
.shuffle(num_train_examples, seed=seed)
.batch(total_train_batch_size, dataloader_drop_last)
.prefetch(tf.data.experimental.AUTOTUNE)
)
return training_args.strategy.experimental_distribute_dataset(ds), num_train_examples
# Get Training Args
training_args, num_train_examples = TFTrainingArguments(...) # Create a normal training_args object
# Manual settings to avoid prematurely initializing the strategy
tpu_cores = 8
train_batch_size = tpu_cores * training_args.per_device_train_batch_size
# Formatting tf dataset from lists of different kinds of inputs
input_ids = tf.convert_to_tensor(train_input_ids) # train_input_ids[0] is a list of input ids and train_input_ids is a list of lists
attention_mask = tf.convert_to_tensor(attention_mask) # as above
token_type_ids = tf.convert_to_tensor(token_type_ids) # as above
labels = tf.convert_to_tensor(labels) # as above
train_inputs = {'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_type_ids}
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs, train_labels))
# Now, call the function to do shuffle, batch and initialize the strategy
train_ds = get_train_tfdataset(train_dataset = train_dataset ,
training_args = training_args,
train_batch_size = train_batch_size ,
gradient_accumulation_steps = training_args.gradient_accumulation_steps
)
# Then, load the model with the strategy
with training_args.strategy.scope():
model = TFGPT2LMHeadModel.from_pretrained('gpt2-medium')
# Now, train it
trainer = LMTFTrainer(args = training_args,
model = model,
num_train_examples = num_train_examples,
total_train_batch_size = 8,
train_dataset = train_ds)
trainer.train()
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:14 (10 by maintainers)
Top Results From Across the Web
Trainer - Hugging Face
Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the...
Read more >Hugging Face on Twitter: "The TFTrainer was contributed by ...
Same user-facing API for PyTorch and TF 2 - Support for GPU, Multi-GPU, and TPU - Easier than ever to share your fine-tuned...
Read more >Huggingface error while training model with custom data
I'm using transformers==4.13.0 for the task. When I run this code on colab: with training_args.strategy.scope(): model = ...
Read more >Practical JAX : Using Hugging Face BERT on TPUs - YouTube
A look at the Hugging Face BERT code, written in JAX / FAX, being fine-tuned on Google's Colab using Google TPUs (Tensor Processing...
Read more >Working with Hugging Face Transformers and TF 2.0
It supports tokenization for every model which is associated with it. →Every transformer model has a similar token definition API. →Here I am ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@jplu Great points, Julien. The proposal above is just a temporary work-around. From a user perspective, there really aren’t any options in
get_train_tfdataset
that haven’t already been declared elsewhere, so this is a routine task with no value in exposing it to the user. Therefore, it should be hidden somewhere. The question is whether that somewhere is in theTFTrainer
or inTFTrainingArguments
. From a library management perspective, there are a lot of considerations, including how similarTFTrainer
andTFTrainingArguments
are toTrainer
andTrainingArguments
for pytorch. You want these classes to behave as similarly as possible. With that in mind, here are the options from best to worst:TFTrainingArguments
tpu initialization procedure so thatget_train_tfdataset
can be left inTFTrainer
. The model is still likely to be initialized outside of the scope, so a full-proof way of dealing with this is to re-initialize the model whentrainer.train()
is called by adding something like this intrain.train()
:TFTrainingArguments
is first declared. In that case,get_train_tfdataset
could be placed inside ofTFTrainingArguments
. We’d also need to know in the documentation that the model has to be loaded afterTFTrainingArguments
and with the clausewith training_args.strategy.scope():
coming before the line that loads the model.@volker42maru I haven’t had any problems with loading TF data records directly. Can you restructure so that the dataset is something like
tf.data.Dataset.from_tensor_slices((train_inputs, train_labels))
? Are you sure your batch size is equal to at least the the number of tensor cores and you’re callingstrategy.experimental_distribute_dataset(dataset)
somewhere? I’ve been able to load and transform data just fine on Colab. You can also connect to Google Drive too and use it as a disk with:This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.