wandb: Network error (SSLError), entering retry loop.
See original GitHub issueIssue description
wandb: Network error (SSLError), entering retry loop.
interferes with training.
Current behavior
The training still runs and I can see the metrics in wandb dashboard, wandb: Network error resolved after 0:06:24.504729, resuming normal operation
. However, I think it really slows down the training as this occurs very frequently. From the wandb debug.log there is:
Caused by SSLError(SSLError(1, '[SSL: KRB5_S_TKT_NYV] unexpected eof while reading (_ssl.c:1091)
wandb support said (April 2022):
happens as a result of either (1) Improper installation of SSL on your python distro as noted by some SO users here. I would recommend reinstalling Anaconda/your virtual environment and and upgrade openssl.
(But i don’t think I have permission to do so on the Neuropoly servers.)
Expected behavior
run without interruption.
Steps to reproduce
running normal training: ivadomed --train -c config_Mod3DUnet_ax.json --path-data ../data/ --path-output ../results/
with bavaria-quebec
preprocessed data.
config file
{
"command": "train",
"gpu_ids": [0],
"path_output": "../results/ax_output_run1",
"model_name": "ModifiedUnet3d_singleContrast",
"debugging": true,
"object_detection_params": {
"object_detection_path": null,
"safety_factor": [1.0, 1.0, 1.0]
},
"wandb": {
"wandb_api_key": "",
"project_name": "bavaria",
"group_name": "lesion_ax",
"run_name": "ax_run1",
"log_grads_every": 100
},
"loader_parameters": {
"path_data": ["~/duke/temp/kiri/bavaria-preprocessed"],
"subject_selection:": {"n": [], "metadata": [], "value": []},
"target_suffix": ["_lesion-manual"],
"extensions": [".nii.gz"],
"roi_params": {
"suffix": null,
"slice_filter_roi": null
},
"contrast_params": {
"training_validation": ["T2w"],
"testing": ["T2w"],
"balance": {}
},
"slice_filter_params": {
"filter_empty_mask": false,
"filter_empty_input": false
},
"slice_axis": "axial",
"multichannel": false,
"soft_gt": false
},
"split_dataset": {
"fname_split": null,
"random_seed": 42,
"split_method" : "participant_id",
"data_testing": {"data_type": null, "data_value":[]},
"balance": null,
"train_fraction": 0.6,
"test_fraction": 0.2
},
"training_parameters": {
"batch_size": 2,
"loss": {
"name": "DiceLoss"
},
"training_time": {
"num_epochs": 100,
"early_stopping_patience": 100,
"early_stopping_epsilon": 0.001
},
"scheduler": {
"initial_lr": 1e-3,
"lr_scheduler": {
"name": "CosineAnnealingLR",
"base_lr": 1e-5,
"max_lr": 1e-3
}
},
"balance_samples": {"applied": false, "type": "gt"}
},
"default_model": {
"name": "Unet",
"dropout_rate": 0.3,
"bn_momentum": 0.1,
"final_activation": "sigmoid",
"is_2d": false,
"depth": 4
},
"Modified3DUNet": {
"applied": true,
"length_3D": [160, 160, 720],
"stride_3D": [80, 80, 360],
"attention": false,
"n_filters": 3
},
"uncertainty": {
"epistemic": false,
"aleatoric": false,
"n_it": 0
},
"postprocessing": {
"binarize_prediction": {"thr": 0.5},
"uncertainty": {"thr": -1, "suffix": "_unc-vox.nii.gz"}
},
"evaluation_parameters": {},
"transformation": {
"Resample": {
"wspace": 0.5,
"hspace": 0.5,
"dspace": 1
},
"CenterCrop": {
"size": [160, 160, 720]
},
"RandomAffine": {
"degrees": 10,
"scale": [0.3, 0.3, 0.3],
"translate": [0.1, 0.1, 0.1],
"applied_to": ["im", "gt"],
"dataset_type": ["training"]
},
"ElasticTransform": {
"alpha_range": [25.0, 35.0],
"sigma_range": [3.5, 4.5],
"p": 0.5,
"applied_to": ["im", "gt"],
"dataset_type": ["training"]
},
"RandomReverse": {
"applied_to": ["im", "gt"],
"dataset_type": ["training"]
},
"RandomGamma": {
"log_gamma_range": [-1.5, 1.5],
"p": 0.5,
"applied_to": ["im"],
"dataset_type": ["training"]
},
"RandomBiasField": {
"coefficients": 0.5,
"order": 3,
"p": 0.3,
"applied_to": ["im"],
"dataset_type": ["training"]
},
"RandomBlur": {
"sigma_range": [0.0, 1.0],
"p": 0.3,
"applied_to": ["im"],
"dataset_type": ["training"]
},
"NumpyToTensor": {},
"NormalizeInstance": {"applied_to": ["im"]}
}
}
Environment
System description
NeuroPoly server, Rosenberg, Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-53-generic x86_64)
Installed packages
on branch mhb/1213-fix-3d-data-augmentation
from PR 1222
Output of pip freeze
absl-py==1.1.0
astor==0.8.1
astunparse==1.6.3
awscli==1.22.34
beniget==0.4.1
bids-validator==1.9.9
botocore==1.23.34
brz-etckeeper==0.0.0
cachetools==5.2.0
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
colorama==0.4.4
coloredlogs==15.0.1
command-not-found==0.3
commonmark==0.9.1
cryptography==3.4.8
csv-diff==1.1
cycler==0.11.0
dbus-python==1.2.18
decorator==4.4.2
Deprecated==1.2.13
dictdiffer==0.9.0
dill==0.3.5.1
distlib==0.3.4
distro==1.7.0
distro-info===1.1build1
dnspython==2.1.0
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.17.1
filelock==3.6.0
flatbuffers==2.0.7
fonttools==4.33.3
formulaic==0.3.4
fsleyes==1.5.0
fsleyes-props==1.8.2
fsleyes-widgets==0.12.3
fslpy==3.9.5
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.8.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
gpg===1.16.0-unknown
grpcio==1.47.0
h5py==3.7.0
humanfriendly==10.0
humanize==4.4.0
idna==3.3
imageio==2.22.4
importlib-metadata==4.6.4
interface-meta==1.3.0
iotop==0.6
-e git+https://github.com/ivadomed/ivadomed.git@d6385f1c57b7433a57003167c215f2288db3b631#egg=ivadomed
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
keras==2.11.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.3
libclang==14.0.1
loguru==0.6.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.2
more-itertools==8.10.0
mpmath==1.2.1
netifaces==0.11.0
networkx==2.8.8
nibabel==3.2.2
num2words==0.5.12
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
onnxruntime==1.13.1
opt-einsum==3.3.0
osfclient==0.0.5
packaging==21.3
pandas==1.4.4
pathtools==0.1.2
Pillow==9.0.1
platformdirs==2.5.1
ply==3.11
promise==2.3
protobuf==3.19.4
psutil==5.9.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybids==0.15.5
Pygments==2.11.2
PyGObject==3.42.1
PyOpenGL==3.1.6
pyparsing==2.4.7
python-apt==2.3.0+ubuntu2.1
python-dateutil==2.8.1
pythran==0.10.0
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
requests==2.25.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rich==12.6.0
roman==3.3
rsa==4.8
s3transfer==0.5.0
scikit-image==0.19.3
scikit-learn==1.2.0
scipy==1.8.0
screen-resolution-extra==0.0.0
seaborn==0.12.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.11
SimpleITK==2.2.1
six==1.16.0
smmap==5.0.0
SQLAlchemy==1.3.24
ssh-import-id==5.11
sympy==1.11.1
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0
torchaudio==0.13.0
torchio==0.18.86
torchvision==0.12.0
tqdm==4.64.0
typer==0.7.0
typing_extensions==4.2.0
ubuntu-drivers-common==0.0.0
ufw==0.36.1
unattended-upgrades==0.1
urllib3==1.26.13
virtualenv==20.13.0+ds
wandb==0.13.7
Werkzeug==2.1.2
wrapt==1.14.1
wxPython==4.0.7
xkit==0.0.0
zipp==1.0.0
Issue Analytics
- State:
- Created 9 months ago
- Comments:5 (4 by maintainers)
Still, if you want the live mode (which is useful), we need to figure out what is wrong in your config. I don’t think it’s a network issue because I’m using the same computer and I don’t experience this issue.
Thanks for the suggestion @kanishk16, not getting the error message after setting ‘mode’ = ‘offline’