question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Get data timeout, key=root:110:ALLGATHER

See original GitHub issue

Issue Type

Others

Source

binary

Secretflow Version

latest

OS Platform and Distribution

ubuntu 18.04

Python version

3.8.13

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

2022-07-28 16:16:13,219 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::SPURuntime.run() (pid=13081, ip=10.100.82.74, repr=<secretflow.device.device.spu.SPURuntime object at 0x7f1fd47b1220>)
  File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 224, in run
    self.runtime.run(executable)
  File "/home/ops/anaconda3/envs/secretflow/lib/python3.8/site-packages/spu/binding/api.py", line 43, in run
    return self._vm.Run(executable.SerializeToString())
RuntimeError: what: 
        [external/yasl/yasl/link/transport/channel.cc:86] Get data timeout, key=root:110:ALLGATHER
stacktrace: 
#0 yasl::link::Context::RecvInternal()+0x7f202eb100b2
#1 yasl::link::AllGatherImpl<>()+0x7f202e9c8785
#2 yasl::link::AllGather()+0x7f202e9c8cb4
#3 spu::mpc::Communicator::allReduce()+0x7f202e2c7a37
#4 spu::mpc::semi2k::B2A_Randbit::proc()::{lambda()#1}::operator()()::{lambda()#3}::operator()()+0x7f202e2bd9f2
#5 spu::mpc::semi2k::B2A_Randbit::proc()+0x7f202e2c0a89
#6 spu::mpc::UnaryKernel::evaluate()+0x7f202e19efdb
#7 spu::mpc::Object::call<>()+0x7f202e2c60b8
#8 spu::mpc::(anonymous namespace)::_Lazy2A()+0x7f202e2dfb19
#9 spu::mpc::ABProtAddSP::proc()+0x7f202e2e019b
#10 spu::mpc::BinaryKernel::evaluate()+0x7f202e19f2f2
#11 spu::mpc::Object::call<>()+0x7f202e2c6866
#12 spu::mpc::add_sp()+0x7f202e2c6994
#13 spu::hal::_add_sp()+0x7f202e171b63
#14 spu::hal::_add()+0x7f202e167486
#15 spu::hal::_popcount()+0x7f202e168b8c

Reproduction code to reproduce the issue.

在做三方逻辑回归时,遇到上述报错。似乎和训练的数据量有关系。这块如果代码不调整的话,是否只能升级机器配置或加计算节点优化呢?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16

github_iconTop GitHub Comments

1reaction
6fjcommented, Aug 2, 2022

Hi @mingo0117 ,

首先,你需要通过设置spu的config来开启相应的log:

然后,你需要在secretflow init的时候打开log_to_driver,类似于

import secretflow as sf

sf.init(['alice', 'bob'], num_cpus=8, log_to_driver=True)
0reactions
mingo0117commented, Aug 2, 2022

Hi @mingo0117 ,

首先,你需要通过设置spu的config来开启相应的log:

然后,你需要在secretflow init的时候打开log_to_driver,类似于

import secretflow as sf

sf.init(['alice', 'bob'], num_cpus=8, log_to_driver=True)

明白了,非常感谢

Read more comments on GitHub >

github_iconTop Results From Across the Web

Query timeout exception reason? - Stack Overflow
Try storing the result in some variable and then use LINQ on it. var data = connection.Query<MyClass>("MySP"); Your variable = data.
Read more >
How do I resolve query timeout issues in QuickSight? - AWS
I'm trying to import data into Amazon QuickSight using direct query mode, but I get a query timeout error. How to resolve this...
Read more >
Troubleshoot query time-out errors - SQL Server
This article describes how to troubleshoot the time-out errors when you run slow-running queries.
Read more >
Eliminate timeout exception error - OSIsoft Documentation
On the PI DataLink tab, in the Resources group, click Settings to open the Settings window. Click Connection Manager to open the Servers...
Read more >
If you get timeout errors with Exchange accounts on iPhone ...
These errors might make the Exchange email, calendar, or contact data on an iOS device reload. If you're an administrator, learn how to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found