question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot switch back to CPU training aftering doing TPU training

See original GitHub issue

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: TPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

First, configure Accelerate to do TPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 3
What is the name of the function in your script that should be launched in all parallel scripts? [main]:
How many TPU cores should be used for distributed training? [1]:

Next, run the example script

$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...

Then, configure Accelerate to do CPU training

$ pipenv run accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
How many CPU(s) should be used for distributed training? [1]:96
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: NO

Finally, run the example script again

$ pipenv run accelerate launch accelerate/examples/nlp_example.py
... output omitted ...

Expected behavior

After I specified CPU training, the last run still outputs something like the following


Reusing dataset cifar100 (/home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142)
Loading cached processed dataset at /home/qys/.cache/huggingface/datasets/cifar100/cifar100/1.0.0/f365c8b725c23e8f0f8d725c3641234d9331cd2f62919d1381d1baa5b3ba3142/cache-edd23acaf2e749df.arrow
2022-06-18 14:15:25.376688: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-18 14:15:25.376774: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey

Clearly, it’s still using TPU. How can I re-config Accelerate to use CPUs only?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
muellerzrcommented, Jun 21, 2022

@nalzok it is actually performing CPU training, however upon importing torch it still warms up the TPU regardless since it acts as a hook, that is why you see this.

To prove this I added the following code in the training function:

def training_function(config, args):
    # Initialize accelerator
    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
    print(f'DEVICE: {accelerator.device}') # Added what we train on
    print(f'NUM_PROCESSES: {accelerator.num_processes}') # Added the number of processes

And launched it via:

accelerate launch --num_processes 1 accelerate/examples/nlp_example.py --cpu

And it printed out the right information that it was being trained on.

0reactions
github-actions[bot]commented, Aug 15, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting TensorFlow - TPU - Google Cloud
This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...
Read more >
Tensorflow 2: how to switch execution from GPU to CPU and ...
Run training on GPU with tf.device('/gpu:0'): model.fit(...) # Run inference on CPU with tf.device('/cpu:0'): model.predict(...).
Read more >
Tensor Processing Unit (TPU) - PyTorch Lightning
TPU Pod. To train on more than 8 cores, your code actually doesn't change! All you need to do is submit the following...
Read more >
TPU Training. Harnessing the power of dedicated DNN…
If you are able to compile your model on a CPU then 99 times out of 100 your model will be valid on...
Read more >
Use TPUs | TensorFlow Core
Define a Keras model; Load the dataset; Train the model using Keras high-level ... accelerator is a TPU by checking your notebook settings:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found