question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help needed to enable TPU Training

See original GitHub issue

I’ve been trying to get TensorflowTTS to train on Cloud TPUs because they’re really fast and easy to access with the TRC, starting with MB-MelGAN+HiFi-GAN discriminator. I’ve already implemented all changes, including dataloader overhauls to use TFRecords and Google Cloud required here. When I try to train, however, I get this cryptic error, both in TF 2.5.0 and nightly (I didn’t use TF 2.3.1 because it allocates something wrongly to the CPU causing another error).

         [[cond_1]]
         [[TPUReplicate/_compile/_10135486412832257275/_4]]
         [[TPUReplicate/_compile/_10135486412832257275/_4/_76]]
  (4) Invalid argument: {{function_node __inference__one_step_forward_179257}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])

[64,<=8192] are [batch_size, batch_max_steps] Here’s the full training log: train_log.txt I can’t figure out what causes this issue, no matter what I try. Any idea? Being able to train on TPUs would be really beneficial and within reach. I can provide specific instructions to replicate the issue, but it requires a Google Cloud with storage even if using Colab TPU (Tensorflow 2.x refuses to save and load data from local filesystem when using TPU). The same code, including TFRecord dataloader, trains fine on GPU.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11

github_iconTop GitHub Comments

1reaction
ZDisketcommented, Jul 30, 2021

It seems that the people over at TensorflowASR already have TPU support and ran into problems in the past as well - might be worth looking into: https://github.com/TensorSpeech/TensorFlowASR/issues/100

0reactions
stale[bot]commented, Oct 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using TPUs to train your model | AI Platform Training
To use a TPU with AI Platform Training, configure your training job to access a TPU-enabled machine in one of three ways: Use...
Read more >
Use TPUs | TensorFlow Core
This guide demonstrates how to perform basic training on Tensor Processing Units (TPUs) and TPU Pods, a collection of TPU devices connected by...
Read more >
A Comprehensive Guide to training CNNs on TPU | Nov, 2022
Obviously, I haven't covered everything you need to know to successfully train your ML models on TPU. Below is a list of some...
Read more >
Step-by-Step Use of Google Colab's Free TPU - Heartbeat
A guide to using Google's TPU to train powerful deep learning models. ... You'll need to make the TPU selection on Google Colab...
Read more >
Training Your Models on Cloud TPUs in 4 Easy Steps on ...
Training Your Models on Cloud TPUs in 4 Easy Steps on Google Colab · I trained an Neural Machine Translation(NMT) model on a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found