Using Accelerate with TPU Pod VM like v3-32
See original GitHub issueHi, thank you for great library.
I have just install accelerate on a TPU VM V3-32 but when I set number of TPU cores to 32 with accelerate config
and run accelerate test
, it throw an error:
ValueError: The number of devices must be either 1 or 8, got 32 instead
So that mean accelerate havenβt supported training on a TPU pod VM. Can you please add this feature to Accelerate?
By the way, I meet another problem, too. If I use accelerate=0.9
with TPU VM v2-alpha
, accelerate test
run successfully. But if I use accelerate=0.10
with v2-alpha
or tpu-vm-pt-1.11
or tpu-vm-pt-1.10
, accelerate test
can not finish runing, it just run forever.
And when I run
accelerate launch run_clm_no_trainer.py \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--model_name_or_path gpt2 \
--output_dir /tmp/test-clm
it throw some errors (even accelerate=0.9
with TPU VM v2-alpha
).
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - ***** Running training *****
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Num examples = 2318
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Num Epochs = 3
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Instantaneous batch size per device = 8
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Total train batch size (w. parallel, distributed & accumulation) = 64
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Gradient Accumulation steps = 1
06/24/2022 18:10:16 - INFO - run_clm_no_trainer - Total optimization steps = 111
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 16.44ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 16.31ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 16.66ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 16.12ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 15.94ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 15.75ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 37/37 [00:02<00:00, 14.59ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 17.02ba/s]
Grouping texts in chunks of 1024: 50%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 2/4 [00:00<00:00, 14.53ba/s]2022-06-24 18:10:19.812027: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:19.812100: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 17.28ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 16.89ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 15.94ba/s]
Grouping texts in chunks of 1024: 50%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 2/4 [00:00<00:00, 14.34ba/s]2022-06-24 18:10:20.217092: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.217159: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-06-24 18:10:20.223097: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.223158: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-06-24 18:10:20.231867: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.231934: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 16.53ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 16.28ba/s]
Grouping texts in chunks of 1024: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 14.42ba/s]
2022-06-24 18:10:20.468890: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.468975: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-06-24 18:10:20.474551: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.474636: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-06-24 18:10:20.509402: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-06-24 18:10:20.509462: E tensorflow/core/framework/op_kernel.cc:1693] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
1%|ββ | 1/111 [00:06<12:12, 6.66s/it]2022-06-24 18:11:19.419635: F tensorflow/core/tpu/kernels/tpu_program_group.cc:86] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7f147ec0c18b,7f147ec0c20f,7f13cd4ff64f,7f13c833ec97,7f13c8333b01,7f13c835429e,7f13c8353e0b,7f13c4f6793d,7f13c98422a8,7f13ccff5580,7f13ccff7943,7f13cd4d0f71,7f13cd4d07a0,7f13cd4ba32b,7f147ebac608&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f13c06a5000-7f13d0013e28
*** SIGABRT received by PID 26683 (TID 28667) on cpu 14 from PID 26683; stack trace: ***
PC: @ 0x7f147ec0c18b (unknown) raise
@ 0x7f120bb881e0 976 (unknown)
@ 0x7f147ec0c210 3968 (unknown)
@ 0x7f13cd4ff650 16 tensorflow::internal::LogMessageFatal::~LogMessageFatal()
@ 0x7f13c833ec98 592 tensorflow::tpu::TpuProgramGroup::Initialize()
@ 0x7f13c8333b02 1360 tensorflow::tpu::TpuCompilationCacheExternal::InitializeEntry()
@ 0x7f13c835429f 800 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsentHelper()
@ 0x7f13c8353e0c 128 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsent()
@ 0x7f13c4f6793e 944 tensorflow::XRTCompileOp::Compute()
@ 0x7f13c98422a9 432 tensorflow::XlaDevice::Compute()
@ 0x7f13ccff5581 2080 tensorflow::(anonymous namespace)::ExecutorState<>::Process()
@ 0x7f13ccff7944 48 std::_Function_handler<>::_M_invoke()
@ 0x7f13cd4d0f72 128 Eigen::ThreadPoolTempl<>::WorkerLoop()
@ 0x7f13cd4d07a1 48 tensorflow::thread::EigenEnvironment::CreateThread()::{lambda()#1}::operator()()
@ 0x7f13cd4ba32c 80 tensorflow::(anonymous namespace)::PThread::ThreadFn()
@ 0x7f147ebac609 (unknown) start_thread
https://symbolize.stripped_domain/r/?trace=7f147ec0c18b,7f120bb881df,7f147ec0c20f,7f13cd4ff64f,7f13c833ec97,7f13c8333b01,7f13c835429e,7f13c8353e0b,7f13c4f6793d,7f13c98422a8,7f13ccff5580,7f13ccff7943,7f13cd4d0f71,7f13cd4d07a0,7f13cd4ba32b,7f147ebac608&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f13c06a5000-7f13d0013e28,ca1b7ab241ee28147b3d590cadb5dc1b:7f11fee89000-7f120bebbb20
E0624 18:11:19.687595 28667 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0624 18:11:19.687634 28667 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0624 18:11:19.687656 28667 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0624 18:11:19.687666 28667 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0624 18:11:19.687679 28667 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0624 18:11:19.687727 28667 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0624 18:11:19.687735 28667 coredump_hook.cc:525] RAW: Discarding core.
E0624 18:11:19.966672 28667 process_state.cc:771] RAW: Raising signal 6 with default behavior
Can you please tell me which TPU VM version do you ussually use with Accelerate?
Thank you!
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5
Top GitHub Comments
Thank you @muellerzr!
Here is the error I met:
I used TPU VM
v2-alpha
, and above error happend with bothaccelerate 0.9 and 0.10
.Weβre going to keep this issue and the linked issue below open about the TPU pods, see Sylvain and Iβs last note on it for more information as to whatβs happening currently and the state weβre at with it https://github.com/huggingface/accelerate/issues/501#issuecomment-1256589109