Questions wrt training on TPU Pod
See original GitHub issueHi Accelerate Team,
I’m looking to use run_mlm_no_trainer.py
on TPU v3-128 pod. I have few questions before I want to get started with the process.
- Can I stream the data directly GCP bucket, or should I have to download the data to the VM where I’m training?
- Does accelerate library support training on TPU pod or is it limited to 8 cores? based on this #471
- Should I be using TPU node or TPU VM for better performance with accelerate library?
- Is there a notebook or blog to help setup environment and run small tests on GCP VM for TPU training?
- I want to train with dataset of order 1TB, will hf datasets be able to handle this data on a machine wth 256GB RAM? (possibly a Q for datasets repo, but just trying here as well)
Thanks in Advance
cc : @sgugger @muellerzr
Issue Analytics
- State:
- Created a year ago
- Comments:35 (2 by maintainers)
Top Results From Across the Web
Training on TPU Pods | Google Cloud
A TPU Pod allows you to distribute the processing load across multiple TPUs. ... The setup for training with TPU Pods is different...
Read more >Exploring the limits of concurrency in ML Training on Google ...
This paper presents techniques to scale. ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism...
Read more >PyTorch/XLA master documentation
Example training scripts on TPU pod (with 10 billion parameters)¶. To train large models that cannot fit into a single TPU, one should...
Read more >Distributed Evolution Strategies Using TPUs for Meta-Learning ...
This paper assesses the viability of an evolutionary strategy meta-learning approach on supervised few-shot classification problems.
Read more >trax-ml/community - Gitter
I am new to Trax so forgive the basic question. I noticed that when I use the Loop class to train my model,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
To help us properly gauge the need for this feature, if you are actively trying to train on a TPU pod with PyTorch could you react with a 👍 to this message? 😄
Thanks!
@Ontopic @sumanthd17 Hi there, please react above message with a 👍🏻 if you want to train models on more than 8 TPU cores in future