Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help needed to enable TPU Training

See original GitHub issue

I’ve been trying to get TensorflowTTS to train on Cloud TPUs because they’re really fast and easy to access with the TRC, starting with MB-MelGAN+HiFi-GAN discriminator. I’ve already implemented all changes, including dataloader overhauls to use TFRecords and Google Cloud required here. When I try to train, however, I get this cryptic error, both in TF 2.5.0 and nightly (I didn’t use TF 2.3.1 because it allocates something wrongly to the CPU causing another error).

         [[cond_1]]
         [[TPUReplicate/_compile/_10135486412832257275/_4]]
         [[TPUReplicate/_compile/_10135486412832257275/_4/_76]]
  (4) Invalid argument: {{function_node __inference__one_step_forward_179257}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])

[64,<=8192] are [batch_size, batch_max_steps] Here’s the full training log: train_log.txt I can’t figure out what causes this issue, no matter what I try. Any idea? Being able to train on TPUs would be really beneficial and within reach. I can provide specific instructions to replicate the issue, but it requires a Google Cloud with storage even if using Colab TPU (Tensorflow 2.x refuses to save and load data from local filesystem when using TPU). The same code, including TFRecord dataloader, trains fine on GPU.

Issue Analytics

State:
Created 2 years ago
Comments:11

Top GitHub Comments

1reaction

ZDisketcommented, Jul 30, 2021

It seems that the people over at TensorflowASR already have TPU support and ran into problems in the past as well - might be worth looking into: https://github.com/TensorSpeech/TensorFlowASR/issues/100

0reactions

stale[bot]commented, Oct 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Top Results From Across the Web

Using TPUs to train your model | AI Platform Training

To use a TPU with AI Platform Training, configure your training job to access a TPU-enabled machine in one of three ways: Use...

Use TPUs | TensorFlow Core

This guide demonstrates how to perform basic training on Tensor Processing Units (TPUs) and TPU Pods, a collection of TPU devices connected by...

A Comprehensive Guide to training CNNs on TPU | Nov, 2022

Obviously, I haven't covered everything you need to know to successfully train your ML models on TPU. Below is a list of some...

Step-by-Step Use of Google Colab's Free TPU - Heartbeat

A guide to using Google's TPU to train powerful deep learning models. ... You'll need to make the TPU selection on Google Colab...

Training Your Models on Cloud TPUs in 4 Easy Steps on ...

Training Your Models on Cloud TPUs in 4 Easy Steps on Google Colab · I trained an Neural Machine Translation(NMT) model on a...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Help needed to enable TPU Training

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

I get random mel outputs when I training Tacotron-2 from scratch with LJSpeech dataset

Fine-tuning procedure for mb_melgan vocoder, Voice Quality degrading with Fine-tuning.