question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions wrt training on TPU Pod

See original GitHub issue

Hi Accelerate Team,

I’m looking to use run_mlm_no_trainer.py on TPU v3-128 pod. I have few questions before I want to get started with the process.

  1. Can I stream the data directly GCP bucket, or should I have to download the data to the VM where I’m training?
  2. Does accelerate library support training on TPU pod or is it limited to 8 cores? based on this #471
  3. Should I be using TPU node or TPU VM for better performance with accelerate library?
  4. Is there a notebook or blog to help setup environment and run small tests on GCP VM for TPU training?
  5. I want to train with dataset of order 1TB, will hf datasets be able to handle this data on a machine wth 256GB RAM? (possibly a Q for datasets repo, but just trying here as well)

Thanks in Advance

cc : @sgugger @muellerzr

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:35 (2 by maintainers)

github_iconTop GitHub Comments

36reactions
muellerzrcommented, Sep 23, 2022

To help us properly gauge the need for this feature, if you are actively trying to train on a TPU pod with PyTorch could you react with a 👍 to this message? 😄

Thanks!

8reactions
jianguozcommented, Sep 23, 2022

@Ontopic @sumanthd17 Hi there, please react above message with a 👍🏻 if you want to train models on more than 8 TPU cores in future

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training on TPU Pods | Google Cloud
A TPU Pod allows you to distribute the processing load across multiple TPUs. ... The setup for training with TPU Pods is different...
Read more >
Exploring the limits of concurrency in ML Training on Google ...
This paper presents techniques to scale. ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism...
Read more >
PyTorch/XLA master documentation
Example training scripts on TPU pod (with 10 billion parameters)¶. To train large models that cannot fit into a single TPU, one should...
Read more >
Distributed Evolution Strategies Using TPUs for Meta-Learning ...
This paper assesses the viability of an evolutionary strategy meta-learning approach on supervised few-shot classification problems.
Read more >
trax-ml/community - Gitter
I am new to Trax so forgive the basic question. I noticed that when I use the Loop class to train my model,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found