Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions wrt training on TPU Pod

See original GitHub issue

Hi Accelerate Team,

I’m looking to use run_mlm_no_trainer.py on TPU v3-128 pod. I have few questions before I want to get started with the process.

Can I stream the data directly GCP bucket, or should I have to download the data to the VM where I’m training?
Does accelerate library support training on TPU pod or is it limited to 8 cores? based on this #471
Should I be using TPU node or TPU VM for better performance with accelerate library?
Is there a notebook or blog to help setup environment and run small tests on GCP VM for TPU training?
I want to train with dataset of order 1TB, will hf datasets be able to handle this data on a machine wth 256GB RAM? (possibly a Q for datasets repo, but just trying here as well)

Thanks in Advance

cc : @sgugger @muellerzr

Issue Analytics

State:
Created a year ago
Comments:35 (2 by maintainers)

Top GitHub Comments

36reactions

muellerzrcommented, Sep 23, 2022

To help us properly gauge the need for this feature, if you are actively trying to train on a TPU pod with PyTorch could you react with a 👍 to this message? 😄

Thanks!

8reactions

jianguozcommented, Sep 23, 2022

@Ontopic @sumanthd17 Hi there, please react above message with a 👍🏻 if you want to train models on more than 8 TPU cores in future