Get data timeout, key=root:110:ALLGATHER
See original GitHub issueIssue Type
Others
Source
binary
Secretflow Version
latest
OS Platform and Distribution
ubuntu 18.04
Python version
3.8.13
Bazel version
No response
GCC/Compiler version
No response
What happend and What you expected to happen.
2022-07-28 16:16:13,219 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::SPURuntime.run() (pid=13081, ip=10.100.82.74, repr=<secretflow.device.device.spu.SPURuntime object at 0x7f1fd47b1220>)
File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 224, in run
self.runtime.run(executable)
File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/spu/binding/api.py", line 43, in run
return self._vm.Run(executable.SerializeToString())
RuntimeError: what:
[external/yasl/yasl/link/transport/channel.cc:86] Get data timeout, key=root:110:ALLGATHER
stacktrace:
#0 yasl::link::Context::RecvInternal()+0x7f202eb100b2
#1 yasl::link::AllGatherImpl<>()+0x7f202e9c8785
#2 yasl::link::AllGather()+0x7f202e9c8cb4
#3 spu::mpc::Communicator::allReduce()+0x7f202e2c7a37
#4 spu::mpc::semi2k::B2A_Randbit::proc()::{lambda()#1}::operator()()::{lambda()#3}::operator()()+0x7f202e2bd9f2
#5 spu::mpc::semi2k::B2A_Randbit::proc()+0x7f202e2c0a89
#6 spu::mpc::UnaryKernel::evaluate()+0x7f202e19efdb
#7 spu::mpc::Object::call<>()+0x7f202e2c60b8
#8 spu::mpc::(anonymous namespace)::_Lazy2A()+0x7f202e2dfb19
#9 spu::mpc::ABProtAddSP::proc()+0x7f202e2e019b
#10 spu::mpc::BinaryKernel::evaluate()+0x7f202e19f2f2
#11 spu::mpc::Object::call<>()+0x7f202e2c6866
#12 spu::mpc::add_sp()+0x7f202e2c6994
#13 spu::hal::_add_sp()+0x7f202e171b63
#14 spu::hal::_add()+0x7f202e167486
#15 spu::hal::_popcount()+0x7f202e168b8c
Reproduction code to reproduce the issue.
在做三方逻辑回归时,遇到上述报错。似乎和训练的数据量有关系。这块如果代码不调整的话,是否只能升级机器配置或加计算节点优化呢?
Issue Analytics
- State:
- Created a year ago
- Comments:16
Top Results From Across the Web
Query timeout exception reason? - Stack Overflow
Try storing the result in some variable and then use LINQ on it. var data = connection.Query<MyClass>("MySP"); Your variable = data.
Read more >How do I resolve query timeout issues in QuickSight? - AWS
I'm trying to import data into Amazon QuickSight using direct query mode, but I get a query timeout error. How to resolve this...
Read more >Troubleshoot query time-out errors - SQL Server
This article describes how to troubleshoot the time-out errors when you run slow-running queries.
Read more >Eliminate timeout exception error - OSIsoft Documentation
On the PI DataLink tab, in the Resources group, click Settings to open the Settings window. Click Connection Manager to open the Servers...
Read more >If you get timeout errors with Exchange accounts on iPhone ...
These errors might make the Exchange email, calendar, or contact data on an iOS device reload. If you're an administrator, learn how to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @mingo0117 ,
首先,你需要通过设置spu的config来开启相应的log:
然后,你需要在secretflow init的时候打开log_to_driver,类似于
明白了,非常感谢