question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TFTrainer with TPUs: Here's a suggestion on getting it to work

See original GitHub issue

Environment info

  • transformers version: 3.0.2
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0+cu101 (False)
  • Tensorflow version (GPU?): 2.3.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: Yes

Analysis and Temporary Solution

The approach in TFTrainer and TFTrainingArguments is really good, but it’s not working right now on TPUs. It looks like we need to do some work on updating the trainer. There are a number of errors on this, the common being gradient accumulation (#6479) and Unable to parse tensor proto. Since Julien is on vacation, here’s some things I did to get it to train on Colab with TPUs. It’s hacky, but should be able to get it to work if you’re anxious to use TPUs in until Julien has a fix:

  • The strategy loading order in TFTrainingArguments and TFTrainer doesn’t play well with a typical workflow (process data, create training_args, load model and pass to TFTrainer). The model needs to be loaded after the strategy has been initialized, and right now the strategy is being initialized inside of TFTrainer.
  • Shuffle, batch etc. need to be called prior to instantiating the strategy. I think this has something to do with the way the strategy is defined in TFTrainingArguments.
  • Calling training_args.training_batch_size automatically calculates the number of TPU cores. Unfortunately, this causes the strategy to initialize, so this cannot be used to calculate total_train_batch_size with the current strategy implementation because it will prematurely initialize before shuffle, batch, etc. are done.
  • To avoid the Unable to parse tensor proto, shuffle, batch etc. will need to be pulled from TFTrainer. They’re handled by the TFTrainer method get_train_tfdataset. With the current strategy implementation in TFTrainingArguments, you’ll need to do that after shuffle, batch and before loading the model.

Example with GPT2

Here’s a example implementing the above changes:

# Note: you'll need to build transformers from source

# Grab a temporary version of TFTrainer with get_train_tfdataset pulled out
git clone https://github.com/alexorona/lm_tf_trainer
from lm_tf_trainer import LMTFTrainer

# Pulled out of TFTrainer
def get_train_tfdataset(train_dataset, 
                        training_args,
                        train_batch_size, 
                        gradient_accumulation_steps, 
                        dataloader_drop_last = False, 
                        seed = 40):
    total_train_batch_size = train_batch_size * gradient_accumulation_steps
    num_train_examples = tf.data.experimental.cardinality(train_dataset).numpy()

    if num_train_examples < 0:
        raise ValueError("The training dataset must have an asserted cardinality")
    ds = (
        train_dataset.repeat()
        .shuffle(num_train_examples, seed=seed)
        .batch(total_train_batch_size, dataloader_drop_last)
        .prefetch(tf.data.experimental.AUTOTUNE)
    )

    return training_args.strategy.experimental_distribute_dataset(ds), num_train_examples

# Get Training Args
training_args, num_train_examples = TFTrainingArguments(...) # Create a normal training_args object

# Manual settings to avoid prematurely initializing the strategy
tpu_cores = 8
train_batch_size =  tpu_cores * training_args.per_device_train_batch_size

# Formatting tf dataset from lists of different kinds of inputs
input_ids = tf.convert_to_tensor(train_input_ids) #  train_input_ids[0] is a list of input ids and train_input_ids is a list of lists
attention_mask = tf.convert_to_tensor(attention_mask) #  as above
token_type_ids = tf.convert_to_tensor(token_type_ids) #  as above
labels = tf.convert_to_tensor(labels) #  as above
train_inputs = {'input_ids': input_ids, 'attention_mask': attention_mask, 'token_type_ids': token_type_ids}
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs,  train_labels))

# Now, call the function to do shuffle, batch and initialize the strategy
train_ds = get_train_tfdataset(train_dataset = train_dataset ,
                                    training_args = training_args,
                                    train_batch_size = train_batch_size ,
                                    gradient_accumulation_steps = training_args.gradient_accumulation_steps
                                    )

# Then, load the model with the strategy
with training_args.strategy.scope():
    model = TFGPT2LMHeadModel.from_pretrained('gpt2-medium')

# Now, train it
trainer = LMTFTrainer(args = training_args,
        model = model,
        num_train_examples = num_train_examples,
        total_train_batch_size = 8,
        train_dataset = train_ds)

trainer.train()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
alexoronacommented, Aug 29, 2020

@jplu Great points, Julien. The proposal above is just a temporary work-around. From a user perspective, there really aren’t any options in get_train_tfdataset that haven’t already been declared elsewhere, so this is a routine task with no value in exposing it to the user. Therefore, it should be hidden somewhere. The question is whether that somewhere is in the TFTrainer or in TFTrainingArguments. From a library management perspective, there are a lot of considerations, including how similar TFTrainer and TFTrainingArguments are to Trainer and TrainingArguments for pytorch. You want these classes to behave as similarly as possible. With that in mind, here are the options from best to worst:

  1. See if there’s a way to modify the current TFTrainingArguments tpu initialization procedure so that get_train_tfdataset can be left in TFTrainer. The model is still likely to be initialized outside of the scope, so a full-proof way of dealing with this is to re-initialize the model when trainer.train() is called by adding something like this in train.train():
with args.strategy.scope():
     self.model = self.model
  1. Barring that, it might be possible to initialize the strategy when TFTrainingArguments is first declared. In that case, get_train_tfdataset could be placed inside of TFTrainingArguments. We’d also need to know in the documentation that the model has to be loaded after TFTrainingArguments and with the clause with training_args.strategy.scope(): coming before the line that loads the model.

@volker42maru I haven’t had any problems with loading TF data records directly. Can you restructure so that the dataset is something like tf.data.Dataset.from_tensor_slices((train_inputs, train_labels))? Are you sure your batch size is equal to at least the the number of tensor cores and you’re calling strategy.experimental_distribute_dataset(dataset) somewhere? I’ve been able to load and transform data just fine on Colab. You can also connect to Google Drive too and use it as a disk with:

from google.colab import drive
drive.mount('/content/drive')
0reactions
stale[bot]commented, Nov 5, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainer - Hugging Face
Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the...
Read more >
Hugging Face on Twitter: "The TFTrainer was contributed by ...
Same user-facing API for PyTorch and TF 2 - Support for GPU, Multi-GPU, and TPU - Easier than ever to share your fine-tuned...
Read more >
Huggingface error while training model with custom data
I'm using transformers==4.13.0 for the task. When I run this code on colab: with training_args.strategy.scope(): model = ...
Read more >
Practical JAX : Using Hugging Face BERT on TPUs - YouTube
A look at the Hugging Face BERT code, written in JAX / FAX, being fine-tuned on Google's Colab using Google TPUs (Tensor Processing...
Read more >
Working with Hugging Face Transformers and TF 2.0
It supports tokenization for every model which is associated with it. →Every transformer model has a similar token definition API. →Here I am ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found