Failed to update state for actor (reporter.err)
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: 0.6.6 and latest nightly builds
- Python version: 3.6
- Exact command to reproduce: n/a
Describe the problem
We’re seeing an issue where running Tune on a GCP cluster consistently fails. I don’t know what exactly causes the issue, but there are different errors in several log files (see below). We’re able to reproduce this, but it takes several hours before the error happens.
Source code / logs
monitor.err
:
2019-04-30 17:11:03,350 INFO autoscaler.py:630 -- StandardAutoscaler: 6/7 target nodes (0 pending) (2 updating)
2019-04-30 17:11:03,351 INFO autoscaler.py:631 -- LoadMetrics: MostDelayedHeartbeats={'10.142.0.36': 63.61771631240845, '10.142.0.3': 27.66781783103943, '10.142.0.16': 27.667560577392578, '10.142.0.42': 27.667370080947876, '10.142.0.28': 27.667174577713013}, NodeIdleSeconds=Min=27 Mean=32 Max=63, NumNodesConnected=7, NumNodesUsed=7.0, ResourceUsage=64.0/64.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=27 Mean=32 Max=63
2019-04-30 17:11:03,770 INFO log_timer.py:21 -- NodeUpdater: ray-ant-v3-ant-sac-bug-worker-504429d7: Got IP [LogTimer=603ms]
2019-04-30 17:11:03,770 INFO log_timer.py:21 -- NodeUpdater: ray-ant-v3-ant-sac-bug-worker-504429d7: Applied config 7c810170766ae4f575002fd7a48a85f297578545 [LogTimer=32872ms]
2019-04-30 17:11:03,770 ERROR updater.py:140 -- NodeUpdater: ray-ant-v3-ant-sac-bug-worker-504429d7: Error updating Unable to find IP of node
2019-04-30 17:11:04,685 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1556644264167-587c27c684771-e4dff7ab-3f1c68b0 to finish...
2019-04-30 17:11:10,110 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1556644264167-587c27c684771-e4dff7ab-3f1c68b0 finished.
2019-04-30 17:11:10,111 INFO autoscaler.py:179 -- LoadMetrics: Removed mapping: 10.142.0.36 - 1556644199.7331412
2019-04-30 17:11:10,111 INFO autoscaler.py:185 -- LoadMetrics: Removed 1 stale ip mappings: {'10.142.0.36'} not in {'10.142.0.3', '10.142.0.42', '10.142.0.51', '10.142.0.46', '10.142.0.16', '10.142.0.40', '10.142.0.28'}
2019-04-30 17:11:10,112 INFO autoscaler.py:179 -- LoadMetrics: Removed mapping: 10.142.0.36 - {b'GPU': 0.0, b'CPU': 8.0}
2019-04-30 17:11:10,112 INFO autoscaler.py:185 -- LoadMetrics: Removed 1 stale ip mappings: {'10.142.0.36'} not in {'10.142.0.3', '10.142.0.42', '10.142.0.51', '10.142.0.46', '10.142.0.16', '10.142.0.40', '10.142.0.28'}
2019-04-30 17:11:10,112 INFO autoscaler.py:179 -- LoadMetrics: Removed mapping: 10.142.0.36 - {b'GPU': 0.0, b'CPU': 0.0}
2019-04-30 17:11:10,112 INFO autoscaler.py:185 -- LoadMetrics: Removed 1 stale ip mappings: {'10.142.0.36'} not in {'10.142.0.3', '10.142.0.42', '10.142.0.51', '10.142.0.46', '10.142.0.16', '10.142.0.40', '10.142.0.28'}
2019-04-30 17:11:10,112 INFO autoscaler.py:179 -- LoadMetrics: Removed mapping: 10.142.0.36 - 1556644199.7331412
2019-04-30 17:11:10,112 INFO autoscaler.py:185 -- LoadMetrics: Removed 1 stale ip mappings: {'10.142.0.36'} not in {'10.142.0.3', '10.142.0.42', '10.142.0.51', '10.142.0.46', '10.142.0.16', '10.142.0.40', '10.142.0.28'}
2019-04-30 17:11:10,113 INFO autoscaler.py:456 -- Ending bringup phase
Exception in thread Thread-17:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
raise e
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
self.do_update()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 185, in do_update
self.set_ssh_ip_if_required()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 107, in set_ssh_ip_if_required
assert ip is not None, "Unable to find IP of node"
AssertionError: Unable to find IP of node
2019-04-30 17:11:10,837 INFO autoscaler.py:630 -- StandardAutoscaler: 6/6 target nodes (0 pending) (2 updating)
2019-04-30 17:11:10,837 INFO autoscaler.py:631 -- LoadMetrics: MostDelayedHeartbeats={'10.142.0.46': 0.3470437526702881, '10.142.0.3': 0.3468945026397705, '10.142.0.16': 0.34674549102783203, '10.142.0.42': 0.34657835960388184, '10.142.0.28': 0.3464210033416748}, NodeIdleSeconds=Min=0 Mean=0 Max=0, NumNodesConnected=6, NumNodesUsed=6.0, ResourceUsage=56.0/56.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-04-30 17:11:10,838 INFO autoscaler.py:456 -- Ending bringup phase
2019-04-30 17:11:11,523 INFO autoscaler.py:167 -- Node 10.142.0.36 is newly setup, treating as active
2019-04-30 17:11:11,658 INFO log_timer.py:21 -- NodeUpdater: ray-ant-v3-ant-sac-bug-worker-cdc88a67: Got SSH [LogTimer=41155ms]
2019-04-30 17:11:11,699 INFO autoscaler.py:630 -- StandardAutoscaler: 6/6 target nodes (0 pending) (1 updating) (1 failed to update)
2019-04-30 17:11:11,704 INFO autoscaler.py:631 -- LoadMetrics: MostDelayedHeartbeats={'10.142.0.46': 1.21382474899292, '10.142.0.3': 1.2136754989624023, '10.142.0.16': 1.2135264873504639, '10.142.0.42': 1.2133593559265137, '10.142.0.28': 1.2132019996643066}, NodeIdleSeconds=Min=1 Mean=1 Max=1, NumNodesConnected=6, NumNodesUsed=6.0, ResourceUsage=56.0/56.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=1 Max=1
2019-04-30 17:11:12,493 INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1556644271920-587c27cde96b7-dd17a5f9-9bdf1726 to finish...
2019-04-30 17:11:17,934 INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1556644271920-587c27cde96b7-dd17a5f9-9bdf1726 finished.
raylet.err
:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0430 16:54:09.289755 2633 stats.h:46] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
I0430 16:57:09.048403 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client d164827a07e17f244d6771e86a2f44aa09421d43 at 10.142.0.3:40029
I0430 16:59:36.291008 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client c566126203c11633ebd1af8e1e7d95be3e5a02f9 at 10.142.0.28:37979
I0430 17:02:10.545897 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 3c85e540410669fad939db340d26c40a57f179c3 at 10.142.0.36:35389
I0430 17:04:44.713376 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client a5742bd824cce90db8294b0101fe470669aa109f at 10.142.0.40:40169
I0430 17:07:18.560586 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 240c249f33d3cdce91ed7221395d6927da049ede at 10.142.0.42:43871
I0430 17:09:52.979279 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client e5659e0f1d8367a2dd194371f79bb621ce210731 at 10.142.0.46:40161
I0430 17:10:25.645897 2633 node_manager.cc:1965] Resubmitting task 000000004d20e24468e1815bc6ed1f5440886956 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 17:10:26.816884 2633 node_manager.cc:425] Actor b201a20b0e1087fddf4048d695c70271130a0f1e is disconnected, because its node 3c85e540410669fad939db340d26c40a57f179c3 is removed from cluster. It may be reconstructed.
I0430 17:10:48.382294 2633 node_manager.cc:1965] Resubmitting task 000000004d20e24468e1815bc6ed1f5440886956 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 17:12:34.533567 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client cc7a5cca0df636f5bcb8e0b7b53785962e440bd4 at 10.142.0.51:37009
I0430 17:14:47.049379 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client f8348458b8e8595c88dc7d9418e044b80290280f at 10.142.0.55:40265
I0430 17:17:15.829519 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 139fa67024fb7c2134fe15fc25196030aa7d230c at 10.142.0.61:36279
I0430 17:19:53.450584 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client b736e221542bbcba3d1b65938a8eb4846b69dd29 at 10.142.0.63:39273
I0430 17:22:12.388083 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 3189694de8db7d099861cdc2b7aa389b96ae3109 at 10.142.0.74:41385
I0430 17:24:32.472777 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client e746879fe262cf1113912d7a9a7527eb28b6427e at 10.142.0.75:32853
I0430 17:27:00.124671 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 21c6fb5ededd271c45f22693ab1b44fdc6ca3189 at 10.142.0.80:39827
I0430 17:29:29.049448 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 82594d4b25af48a6b3ddd9b6597f6ba710ed62ea at 10.142.0.82:44759
I0430 17:31:45.116662 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 5b3c03d4d90eff2ef4ea0ae72d4292a519463d36 at 10.142.0.85:41483
I0430 17:33:52.432299 2633 node_manager.cc:425] Actor 1252fdd58b6f21f8d658bf4ce8731a7fe7516a65 is disconnected, because its node f8348458b8e8595c88dc7d9418e044b80290280f is removed from cluster. It may be reconstructed.
I0430 17:34:14.596946 2633 node_manager.cc:1965] Resubmitting task 00000000b4fbe5e3cd96a79230ed2a54a724ae4b on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 17:34:52.438154 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 7bd4660c900b669a6e077f9e340d8aaf0d936b24 at 10.142.0.87:40243
I0430 17:37:15.154327 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 080c951cd89bd277331d9852b9c4fec70507a52b at 10.142.0.90:40515
I0430 17:39:48.409904 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 886560b0fc49c1eb673d30fc6dbae7c5d763747b at 10.142.0.92:36433
I0430 17:42:14.001647 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client e98379d99a4fcd2d6c82cba1943c2bd9d88a1f49 at 10.142.0.96:43073
I0430 17:44:39.168882 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client c664d8445b69276dc2ad696d8686985cee4b0546 at 10.142.0.101:39155
I0430 17:46:58.987653 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 0d8ac10d31f93f5ae4bc152adc237baf5358df51 at 10.142.0.110:37941
I0430 17:49:13.151583 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client afe63942c07d18abfaf9722d316ced4a2ec78737 at 10.142.0.116:44719
I0430 17:51:42.227932 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 530288c8f0d562a41d974427ff93be1c9dfc0cb2 at 10.142.0.117:44113
I0430 17:54:20.996582 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 9e415e5842c5c1e3560aca4e5420cc2691b84519 at 10.142.0.118:41181
I0430 17:56:38.772028 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 397e27911c8028dc8095b0560918e414a9878e18 at 10.142.0.119:38469
I0430 17:58:53.488054 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 8134e1601a785656122ff8cba83c8fef18165bf5 at 10.142.0.120:42965
I0430 18:01:32.311524 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 34bb77f34cf50f43f4b226f8b07da52f7944d91a at 10.142.0.121:34117
I0430 18:03:57.850646 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client ef72beabb673fe5d2c9b41c213214c16877ff4eb at 10.142.0.123:39995
I0430 18:06:21.184984 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 93bbbb9e3b45159073fc0ebba0a67cf176a7dd2e at 10.142.0.125:37125
I0430 18:08:41.359419 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 5cbbab19db018bf7d8a96cce2243f541a11f85c7 at 10.142.0.126:39733
I0430 18:11:06.335155 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client b7af46513089fc06659391ab4d4b63e8c0805561 at 10.142.0.127:37783
I0430 18:13:20.431676 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client e8408fcf1d37e4c149c893967d6bf406e231f044 at 10.142.0.128:44677
I0430 18:15:40.644949 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client c1507fe4e3b7ceb6faa0d76d34a0ef58caff782c at 10.142.0.129:44901
I0430 18:16:21.925865 2633 node_manager.cc:425] Actor 4d32c56dc3fd63be6875b7e164910db0e8dd0455 is disconnected, because its node 34bb77f34cf50f43f4b226f8b07da52f7944d91a is removed from cluster. It may be reconstructed.
I0430 18:17:52.238713 2633 node_manager.cc:1965] Resubmitting task 00000000c6928edb6bd8098177bd36816582c180 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:18:38.199642 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 7faa18abf816c36fbbfdb99e51542710ee40fdd2 at 10.142.0.131:42767
I0430 18:18:50.812274 2633 node_manager.cc:425] Actor 7b43a0c5a4ea6c5e6e57e5c9889f94776475f10e is disconnected, because its node 9e415e5842c5c1e3560aca4e5420cc2691b84519 is removed from cluster. It may be reconstructed.
I0430 18:19:47.780380 2633 node_manager.cc:1965] Resubmitting task 0000000014313af9589165bcf888fbf05148d304 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:20:54.495836 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client ab0cdf6c324ff6c55f8d1556c31c105a3efa6d9f at 10.142.0.132:34883
I0430 18:21:06.810003 2633 node_manager.cc:1965] Resubmitting task 0000000096da81b9676fa2c496fbe523d6010d44 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:21:20.088655 2633 node_manager.cc:425] Actor 36ac9240fc5dcf81bd518d29c35cdaf9b6e66072 is disconnected, because its node 530288c8f0d562a41d974427ff93be1c9dfc0cb2 is removed from cluster. It may be reconstructed.
I0430 18:21:37.492851 2633 node_manager.cc:1965] Resubmitting task 00000000be049b79a56562870f9e6c7247f67b8e on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:22:19.843966 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 85e2bb65fc355b4e1020bb60d7dd33133e12d1ae at 10.142.0.132:40025
I0430 18:24:39.307135 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 5474caa574b620a986f440c37ffed417481e420b at 10.142.0.133:37699
I0430 18:26:58.160915 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client d4856fcc9424d9f7a9deb3d776dbc94b3a418612 at 10.142.0.134:33911
I0430 18:28:13.089535 2633 node_manager.cc:425] Actor 1e5f33ed49590e7c1e392de749144298f2e258a3 is disconnected, because its node 93bbbb9e3b45159073fc0ebba0a67cf176a7dd2e is removed from cluster. It may be reconstructed.
I0430 18:29:05.543244 2633 node_manager.cc:1965] Resubmitting task 00000000dba1e06b516c7eba374068d385de7594 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:29:30.662732 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client 670dcf668b0b5bcb69dfbe2816324e7745353d8c at 10.142.0.135:34577
I0430 18:31:47.932663 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client c11443dfda1ce22f76096ef19242854cd2367af9 at 10.142.0.136:35499
I0430 18:34:01.876145 2633 node_manager.cc:425] Actor 63fb21adb7768d0ad709f259a0dc675832d303b1 is disconnected, because its node afe63942c07d18abfaf9722d316ced4a2ec78737 is removed from cluster. It may be reconstructed.
I0430 18:34:08.127756 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client a15d40c6b4d0afdd0c564ec69aac23a2e4b7a902 at 10.142.0.137:36515
I0430 18:34:29.138823 2633 node_manager.cc:425] Actor 73757030c6cfbb1b16793e687198e3b8e57788a1 is disconnected, because its node c11443dfda1ce22f76096ef19242854cd2367af9 is removed from cluster. It may be reconstructed.
I0430 18:34:48.349531 2633 node_manager.cc:1965] Resubmitting task 0000000070584c42f7dfc488d6088b75694c8590 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:34:48.349756 2633 node_manager.cc:1965] Resubmitting task 0000000013c59865b4ae22b888b9d65d8e9e47f7 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 18:35:35.773393 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client e73a526afcbac36852531a3b9d32748255a9ed7f at 10.142.0.137:33951
I0430 18:38:12.769562 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client bc4e90e4fefd5b3f4a0adde8f7c178203f7b184c at 10.142.0.138:37243
I0430 18:40:26.095615 2633 node_manager.cc:366] [ConnectClient] Trying to connect to client a3c1f46e347a16f17cc833e4508cf3b246c84a82 at 10.142.0.139:41105
I0430 19:05:45.209869 2633 node_manager.cc:425] Actor 56ab128da63644e8a0ea8056ff1a05c88929f121 is disconnected, because its node e8408fcf1d37e4c149c893967d6bf406e231f044 is removed from cluster. It may be reconstructed.
I0430 19:06:21.288885 2633 node_manager.cc:1965] Resubmitting task 0000000086cc01a7bb77b378770af9e9e71c5282 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 19:14:18.724951 2633 node_manager.cc:425] Actor abe3b73880a2a1be1f586dd5ab4f6effc045a38c is disconnected, because its node e98379d99a4fcd2d6c82cba1943c2bd9d88a1f49 is removed from cluster. It may be reconstructed.
I0430 19:15:32.185073 2633 node_manager.cc:1965] Resubmitting task 000000001fac8e7ac9983d17476ef00b23bcf692 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 21:00:36.879307 2633 node_manager.cc:425] Actor a4e0ef1751cfb84f15afafb603c73727f3722a47 is disconnected, because its node 080c951cd89bd277331d9852b9c4fec70507a52b is removed from cluster. It may be reconstructed.
I0430 21:01:26.253690 2633 node_manager.cc:1965] Resubmitting task 000000005f4cbc8b8241d270ca04ac93a7c81cc5 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 21:44:41.322592 2633 node_manager.cc:425] Actor afb5ca0a737805e19f2d75c598a6a235248c7613 is disconnected, because its node b7af46513089fc06659391ab4d4b63e8c0805561 is removed from cluster. It may be reconstructed.
I0430 21:45:33.839720 2633 node_manager.cc:1965] Resubmitting task 00000000c727d4501540c6599e2772ecb3adb47d on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
I0430 22:00:12.921603 2633 node_manager.cc:425] Actor 0d8ed94a719a0e2042910ea4aa30275f06b62548 is disconnected, because its node ef72beabb673fe5d2c9b41c213214c16877ff4eb is removed from cluster. It may be reconstructed.
I0430 22:00:12.962640 2633 node_manager.cc:444] [HeartbeatAdded]: received heartbeat from unknown client id ef72beabb673fe5d2c9b41c213214c16877ff4eb
I0430 22:01:43.276129 2633 node_manager.cc:1965] Resubmitting task 000000005572a40bd888e34a5ebdcb09797f2c20 on client d6ebf6b5c2b58e8e57a1561c3a9e96b285f89d3a
F0430 23:56:55.326088 2633 node_manager.cc:801] Failed to update state for actor 9fd7dbab6cb99b865aed0dd77056595d022cd0fb
*** Check failure stack trace: ***
*** Aborted at 1556668615 (unix time) try "date -d @1556668615" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e800000a49) received by PID 2633 (TID 0x7f6865c15740) from PID 2633; stack trace: ***
@ 0x7f68657f7390 (unknown)
@ 0x7f68649a8428 gsignal
@ 0x7f68649aa02a abort
@ 0x4d6699 google::logging_fail()
@ 0x4d846a google::LogMessage::Fail()
@ 0x4d9773 google::LogMessage::SendToLog()
@ 0x4d8192 google::LogMessage::Flush()
@ 0x4d8381 google::LogMessage::~LogMessage()
@ 0x4d5982 ray::RayLog::~RayLog()
@ 0x44f985 _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_7ActorIDERK15ActorTableDataTEZNS0_6raylet11NodeManager23HandleDisconnectedActorES6_bbEUlS3_S6_S9_E_E9_M_invokeERKSt9_Any_dataS3_S6_S9_
@ 0x49f32f _ZNSt17_Function_handlerIFbRKSsEZN3ray3gcs3LogINS3_7ActorIDE14ActorTableDataE8AppendAtERKNS3_5JobIDERKS6_RSt10shared_ptrI15ActorTableDataTERKSt8functionIFvPNS4_14AsyncGcsClientESD_RKSF_EESQ_iEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
@ 0x4bef60 (anonymous namespace)::ProcessCallback()
@ 0x4bf6e7 ray::gcs::GlobalRedisCallback()
@ 0x4c348b redisProcessCallbacks
@ 0x4c22d6 RedisAsioClient::handle_read()
@ 0x4c1798 boost::asio::detail::reactive_null_buffers_op<>::do_complete()
@ 0x41995d boost::asio::detail::scheduler::run()
@ 0x40d003 main
@ 0x7f6864993830 __libc_start_main
@ 0x4145d0 (unknown)
reporter.err
:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1508, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1584, in name
name = self._parse_stat_file()[0]
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
return fun(self)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1548, in _parse_stat_file
data = f.read()
ProcessLookupError: [Errno 3] No such process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 176, in run
self.perform_iteration()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 165, in perform_iteration
stats = self.get_all_stats()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 156, in get_all_stats
"workers": self.get_workers(),
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in get_workers
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in <listcomp>
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/__init__.py", line 603, in name
name = self._proc.name()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1515, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil._exceptions.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=4020)
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1508, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1584, in name
name = self._parse_stat_file()[0]
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
return fun(self)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1547, in _parse_stat_file
with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 582, in open_binary
return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/19117/stat'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 176, in run
self.perform_iteration()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 165, in perform_iteration
stats = self.get_all_stats()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 156, in get_all_stats
"workers": self.get_workers(),
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in get_workers
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in <listcomp>
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/__init__.py", line 603, in name
name = self._proc.name()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1519, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil._exceptions.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=19117, name='python2')
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1508, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1584, in name
name = self._parse_stat_file()[0]
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
return fun(self)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1547, in _parse_stat_file
with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 582, in open_binary
return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/21185/stat'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 176, in run
self.perform_iteration()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 165, in perform_iteration
stats = self.get_all_stats()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 156, in get_all_stats
"workers": self.get_workers(),
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in get_workers
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in <listcomp>
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/__init__.py", line 603, in name
name = self._proc.name()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1519, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil._exceptions.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=21185)
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1508, in wrapper
return fun(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1584, in name
name = self._parse_stat_file()[0]
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
return fun(self)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1547, in _parse_stat_file
with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_common.py", line 582, in open_binary
return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/27101/stat'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 176, in run
self.perform_iteration()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 165, in perform_iteration
stats = self.get_all_stats()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 156, in get_all_stats
"workers": self.get_workers(),
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in get_workers
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/reporter.py", line 131, in <listcomp>
]) for x in psutil.process_iter() if running_worker(x.name())
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/__init__.py", line 603, in name
name = self._proc.name()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/psutil/_pslinux.py", line 1519, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil._exceptions.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=27101)
worker-*.err
(similar error in several worker files):
Ray worker pid: 2649
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/workers/default_worker.py", line 98, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 1038, in main_loop
task = self._get_next_task_from_raylet()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 1021, in _get_next_task_from_raylet
task = self.raylet_client.get_task()
File "python/ray/_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "python/ray/_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/workers/default_worker.py", line 105, in <module>
driver_id=None)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/utils.py", line 68, in push_error_to_driver
time.time())
File "python/ray/_raylet.pyx", line 297, in ray._raylet.RayletClient.push_error
File "python/ray/_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Connection closed unexpectedly.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
AWS StepFunction - Failed to update state machine
After configuring ECS and Fargate task, I tried updating my state machine definition, but got the error: Failed to update state machine.
Read more >Console Output (Reporters) — Ray 2.2.0
The output can be configured in various ways by instantiating a CLIReporter instance (or JupyterNotebookReporter if you're using jupyter notebook).
Read more >Communicating in a Crisis - SAMHSA Publications
If you suspect that the next information update will drastically change a story, give reporters a sense that such may be the case....
Read more >Bette Midler's Autocorrect Error Sends 'Hocus Pocus' Fans Into ...
Kathy Najimy, Bette Midler and Sarah Jessica Parker in "Hocus Pocus 2," due out on Disney+ Sept. 30. Matt Kennedy/Disney+. UPDATE: Sept. 29...
Read more >Understanding The DAO Attack - CoinDesk
In this piece, Siegal attempts to help journalists understand the DAO ... and promoted The DAO project has been an error and it...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Closing as I haven’t heard any new reports for this.
Hi, I’m a bot from the Ray team 😃
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray’s public slack channel.