Help needed to enable TPU Training
See original GitHub issueI’ve been trying to get TensorflowTTS to train on Cloud TPUs because they’re really fast and easy to access with the TRC, starting with MB-MelGAN+HiFi-GAN discriminator. I’ve already implemented all changes, including dataloader overhauls to use TFRecords and Google Cloud required here. When I try to train, however, I get this cryptic error, both in TF 2.5.0 and nightly (I didn’t use TF 2.3.1 because it allocates something wrongly to the CPU causing another error).
[[cond_1]]
[[TPUReplicate/_compile/_10135486412832257275/_4]]
[[TPUReplicate/_compile/_10135486412832257275/_4/_76]]
(4) Invalid argument: {{function_node __inference__one_step_forward_179257}} Output shapes of then and else branches do not match: (f32[64,<=8192], f32[64,<=8192]) vs. (f32[64,<=8192], f32[0])
[64,<=8192] are [batch_size, batch_max_steps]
Here’s the full training log:
train_log.txt
I can’t figure out what causes this issue, no matter what I try. Any idea? Being able to train on TPUs would be really beneficial and within reach. I can provide specific instructions to replicate the issue, but it requires a Google Cloud with storage even if using Colab TPU (Tensorflow 2.x refuses to save and load data from local filesystem when using TPU). The same code, including TFRecord dataloader, trains fine on GPU.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11
It seems that the people over at TensorflowASR already have TPU support and ran into problems in the past as well - might be worth looking into: https://github.com/TensorSpeech/TensorFlowASR/issues/100
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.