BUG: Training gets killed due to Neptune
See original GitHub issueDescribe the bug
Training was ongoing in PytorchLightning and all of a sudden it has crashed with traces pointing to error being thrown from Neptune.
Reproduction
Couldnβt reproduce
Expected behavior
Training is supposed to continue, without crashing the experiment no matter what the issue is.
Traceback
Epoch 21 βββββββββββββββββββββββββββββββββββββββ 26600/-- 1:09:03 β’ -:--:-- 17.04it/s loss: 0.1 v_num: -200 val_loss: 0.09
Validation βββββββββββββββββββββββββββββββββββββββ 1600/-- 0:02:47 β’ -:--:-- 17.04it/s loss: 0.1 v_num: -200 val_loss: 0.09 Epoch 21, global step 549999: val_track_loss reached 0.08127 (best 0.08127), saving model to "/home/kp/experiment_logs/vad/ruEpoch 22 ββββββββββββββββββββββββββββββββββββββ 26600/-- 1:08:53 β’ -:--:-- 17.31it/s loss: 0.101 v_num: -200 val_loss: 0.09
Validation ββββββββββββββββββββββββββββββββββββββ 1600/-- 0:02:37 β’ -:--:-- 17.31it/s loss: 0.101 v_num: -200 val_loss: 0.09Epoch 22, global step 574999: val_track_loss reached 0.08107 (best 0.08107), saving model to "/home/kp/experiment_logs/vad/ruEpoch 23 βββββββββββββββββββββββββββββββββββββ 26600/-- 1:07:59 β’ -:--:-- 17.34it/s loss: 0.104 v_num: -200 val_loss: 0.089
Validation βββββββββββββββββββββββββββββββββββββ 1600/-- 0:02:27 β’ -:--:-- 17.34it/s loss: 0.104 v_num: -200 val_loss: 0.089/home/kp/Remote/zspeech/zspeech/utils/training_utils.py:308: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
figure = plt.figure(figsize=(8, 8))
/home/kp/Remote/zspeech/zspeech/utils/training_utils.py:308: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
figure = plt.figure(figsize=(8, 8))
/home/kp/Remote/zspeech/zspeech/utils/training_utils.py:308: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Epoch 23 βββββββββββββββββββββββββββββββββββββ 26600/-- 1:07:59 β’ -:--:-- 17.34it/s loss: 0.104 v_num: -200 val_loss: 0.089
Validation βββββββββββββββββββββββββββββββββββββ 1600/-- 0:02:27 β’ -:--:-- 17.34it/s loss: 0.104 v_num: -200 val_loss: 0.089Epoch 23, global step 599999: val_track_loss reached 0.08095 (best 0.08095), saving model to "/home/kp/experiment_logs/vad/ruEpoch 24 ββββββββββββββββββββββββββββββββββββββ 26600/-- 1:08:03 β’ -:--:-- 17.11it/s loss: 0.0989 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1600/-- 0:02:26 β’ -:--:-- 17.11it/s loss: 0.0989 v_num: -200 val_loss:
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26448/-- 1:08:08 β’ -:--:-- 17.19it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1448/-- 0:02:16 β’ -:--:-- 17.19it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 473, in _execute_operations
result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
self.process_batch(batch, version)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
result = self._processor._backend.execute_operations(self._processor._run_id, batch)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 363, in execute_operations
errors.extend(self._execute_operations(run_id, other_operations))
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.pUnexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can
be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
File
"/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py",
line 473, in _execute_operations
result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in
_get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in
unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in
raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in
run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operat
ion_processor.py", line 177, in work
self.process_batch(batch, version)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in
wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operat
ion_processor.py", line 187, in process_batch
result = self._processor._backend.execute_operations(self._processor._run_id, batch)
File
"/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py",
line 363, in execute_operations
errors.extend(self._execute_operations(run_id, other_operations))
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in
wrapper
return func(*args, **kwargs)
File
"/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py",
line 476, in _execute_operations
raise RunUUIDNotFound(run_id=run_id) from e
neptune.new.exceptions.RunUUIDNotFound: Run with ID 62136ba6-d853-4d76-a2df-e6321599479c not found. Could be deleted.
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26510/-- 1:08:11 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1510/-- 0:02:19 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 473, in _execute_operations
result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
self.process_batch(batch, version)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
result = self._processor._backend.execute_operations(self._processor._run_id, batch)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 363, in execute_operations
errors.extend(self._execute_operations(run_id, other_operations))
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.pEpoch 25 ββββββββββββββββββββββββββββββββββββββ 26514/-- 1:08:12 β’ -:--:-- 17.26it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1514/-- 0:02:19 β’ -:--:-- 17.26it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune asynchronous thread. All data is safe on disk and can be later synced manually using `neptune sync` command.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 473, in _execute_operations
result = self.leaderboard_client.api.executeOperations(**kwargs).response().result
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 177, in work
self.process_batch(batch, version)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 187, in process_batch
result = self._processor._backend.execute_operations(self._processor._run_id, batch)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 363, in execute_operations
errors.extend(self._execute_operations(run_id, other_operations))
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.pEpoch 25 ββββββββββββββββββββββββββββββββββββββ 26522/-- 1:08:12 β’ -:--:-- 17.26it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1522/-- 0:02:20 β’ -:--:-- 17.26it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune ping thread. Your run's status will not be updated and the run will be shown as inactive.
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 317, in ping_run
self.leaderboard_client.api.ping(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/utils/ping_background_job.py", line 68, in work
self._run.ping()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/run.py", line 243, in ping
self._backend.ping_run(self._id)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.pUnexpected error occurred in Neptune background thread: Killing Neptune ping thread. Your run's status will not be updated
and the run will be shown as inactive.
Exception in thread Thread-4:
Traceback (most recent call last):
File
"/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py",
line 317, in ping_run
self.leaderboard_client.api.ping(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in
_get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in
unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in
raise_on_expected
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26559/-- 1:08:14 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1559/-- 0:02:22 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune ping thread. Your run's status will not be updated and the run will be shown as inactive.
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.p raise make_http_exception(
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26559/-- 1:08:14 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1559/-- 0:02:22 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089 swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
bravado.exception.HTTPNotFound: 404 Not Found
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26559/-- 1:08:14 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1559/-- 0:02:22 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089 File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
During handling of the above exception, another exception occurred:
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26559/-- 1:08:14 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1559/-- 0:02:22 β’ -:--:-- 17.24it/s loss: 0.0979 v_num: -200 val_loss:
0.089 self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/utils/ping_background_job.py", line 68, in work
self._run.ping()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/run.py", line 243, in ping
self._backend.ping_run(self._id)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.pTraceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in
run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in
wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/utils/ping_background_job.py",
line 68, in work
self._run.ping()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/run.py", line 243, in ping
self._backend.ping_run(self._id)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in
wrapper
return func(*args, **kwargs)
File
"/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py",
line 322, in ping_run
raise RunUUIDNotFound(run_id)
neptune.new.exceptions.RunUUIDNotFound: Run with ID 62136ba6-d853-4d76-a2df-e6321599479c not found. Could be deleted.
Epoch 25 ββββββββββββββββββββββββββββββββββββββ 26561/-- 1:08:14 β’ -:--:-- 17.22it/s loss: 0.0979 v_num: -200 val_loss:
0.089
Validation ββββββββββββββββββββββββββββββββββββββ 1561/-- 0:02:22 β’ -:--:-- 17.22it/s loss: 0.0979 v_num: -200 val_loss:
0.089 Unexpected error occurred in Neptune background thread: Killing Neptune ping thread. Your run's status will not be updated and the run will be shown as inactive.
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.py", line 317, in ping_run
self.leaderboard_client.api.ping(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 200, in response
swagger_result = self._get_swagger_result(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 124, in wrapper
return func(self, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 300, in _get_swagger_result
unmarshal_response(
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 353, in unmarshal_response
raise_on_expected(incoming_response)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/bravado/http_future.py", line 420, in raise_on_expected
raise make_http_exception(
bravado.exception.HTTPNotFound: 404 Not Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 54, in run
self.work()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 78, in wrapper
result = func(self_, *args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/utils/ping_background_job.py", line 68, in work
self._run.ping()
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/run.py", line 243, in ping
self._backend.ping_run(self._id)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/utils.py", line 71, in wrapper
return func(*args, **kwargs)
File "/home/kp/miniconda3/envs/gamd6-kp4/lib/python3.8/site-packages/neptune/new/internal/backends/hosted_neptune_backend.p
Shutting down background jobs, please wait a moment...
Environment
PyTorch version: 1.10.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.4
Libc version: glibc-2.31
Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-38-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: RTX A6000
GPU 1: RTX A6000
GPU 2: RTX A6000
GPU 3: RTX A6000
GPU 4: RTX A6000
GPU 5: RTX A6000
GPU 6: RTX A6000
GPU 7: RTX A6000
Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy==0.910
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.1
[pip3] pytorch-lightning==1.5.1
[pip3] torch==1.10.0+cu111
[pip3] torch-poly-lr-decay==0.0.1
[pip3] torchaudio==0.10.0+cu111
[pip3] torchmetrics==0.6.0
[conda] mypy 0.910 pypi_0 pypi
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] neptune-client 0.12.1 pypi_0 pypi
[conda] numpy 1.21.1 pypi_0 pypi
[conda] pytorch-lightning 1.5.1 pypi_0 pypi
[conda] torch 1.10.0+cu111 pypi_0 pypi
[conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi
[conda] torchaudio 0.10.0+cu111 pypi_0 pypi
[conda] torchmetrics 0.6.0 pypi_0 pypi
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
BUG: ClientHttpError during training Β· Issue #751 Β· neptune ...
Describe the bug Sometimes, during the training, there is a ClientHttpError raised Reproduction I am running a minimal working example onΒ ...
Read more >Top 9 Facts About Operation Neptune Spear & Killing bin ...
The team went through intense training because there was little room for error. 3. One of the planes SEAL Team Six took crashed...
Read more >FDA blames poor training for new deaths, injuries tied to ...
The original recall, issued June 5, warned customers against connecting the Neptune systems to high-powered surgical sucking systems after 1Β ...
Read more >neptune.new
Maybe your spot instance died, and you need to resume your training? Fear not. Neptune is now prepared for that. And more.
Read more >Neptune at fault in deadly plant explosion, CSST rules
8, 2012, explosion at the Neptune Technologies and Bio Resources plant near Sherbrooke killed three people and injured 19.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for sharing this information π!
I will get back to you with updates and/or more questions π if you donβt mind.
Hey @stonelazy
Thanks for your co-operation,
I spoke to the devs and we canβt really replicate it or pinpoint where the error happened as it only happened once but I will keep an eye for such errors during our maintenance breaks and gather more data about them if they repeat.
For now, I will close the issue. π