Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wandb: Network error (SSLError), entering retry loop.

See original GitHub issue

Issue description

wandb: Network error (SSLError), entering retry loop. interferes with training. Screenshot 2022-12-18 at 21 17 56

Current behavior

The training still runs and I can see the metrics in wandb dashboard, wandb: Network error resolved after 0:06:24.504729, resuming normal operation. However, I think it really slows down the training as this occurs very frequently. From the wandb debug.log there is: Caused by SSLError(SSLError(1, '[SSL: KRB5_S_TKT_NYV] unexpected eof while reading (_ssl.c:1091)

wandb support said (April 2022):

happens as a result of either (1) Improper installation of SSL on your python distro as noted by some SO users here. I would recommend reinstalling Anaconda/your virtual environment and and upgrade openssl.

(But i don’t think I have permission to do so on the Neuropoly servers.)

Expected behavior

run without interruption.

Steps to reproduce

running normal training: ivadomed --train -c config_Mod3DUnet_ax.json --path-data ../data/ --path-output ../results/ with bavaria-quebec preprocessed data.

config file

{
    "command": "train",
    "gpu_ids": [0],
    "path_output": "../results/ax_output_run1",
    "model_name": "ModifiedUnet3d_singleContrast",
    "debugging": true,
    "object_detection_params": {
        "object_detection_path": null,
        "safety_factor": [1.0, 1.0, 1.0]
    },
    "wandb": {
        "wandb_api_key": "",
        "project_name": "bavaria",
        "group_name": "lesion_ax",
        "run_name": "ax_run1",
        "log_grads_every": 100
    },
    "loader_parameters": {
        "path_data": ["~/duke/temp/kiri/bavaria-preprocessed"],
        "subject_selection:": {"n": [], "metadata": [], "value": []},
        "target_suffix": ["_lesion-manual"],
        "extensions": [".nii.gz"],
        "roi_params": {
            "suffix": null,
            "slice_filter_roi": null
        },
        "contrast_params": {
            "training_validation": ["T2w"],
            "testing": ["T2w"],
            "balance": {}
        },
        "slice_filter_params": {
            "filter_empty_mask": false,
            "filter_empty_input": false
        },
        "slice_axis": "axial",
        "multichannel": false,
        "soft_gt": false
    },
    "split_dataset": {
        "fname_split": null,
        "random_seed": 42,
        "split_method" : "participant_id",
        "data_testing": {"data_type": null, "data_value":[]},
        "balance": null,
        "train_fraction": 0.6,
        "test_fraction": 0.2
    },
    "training_parameters": {
        "batch_size":    2,
	"loss": {
            "name": "DiceLoss"
        },
        "training_time": {
            "num_epochs": 100,
            "early_stopping_patience": 100,
            "early_stopping_epsilon": 0.001
        },
        "scheduler": {
            "initial_lr": 1e-3,
            "lr_scheduler": {
                "name": "CosineAnnealingLR",
                "base_lr": 1e-5,
                "max_lr": 1e-3
            }
        },
        "balance_samples": {"applied": false, "type": "gt"}
    },
    "default_model": {
        "name": "Unet",
        "dropout_rate": 0.3,
        "bn_momentum": 0.1,
        "final_activation": "sigmoid",
	"is_2d": false,
        "depth": 4
    },
    "Modified3DUNet": {
        "applied": true,
        "length_3D": [160, 160, 720],
        "stride_3D": [80, 80, 360],
        "attention": false,
        "n_filters": 3
    },
    "uncertainty": {
        "epistemic": false,
        "aleatoric": false,
        "n_it": 0
    },
    "postprocessing": {
        "binarize_prediction": {"thr": 0.5},
        "uncertainty": {"thr": -1, "suffix": "_unc-vox.nii.gz"}
    },
    "evaluation_parameters": {},
    "transformation": {
        "Resample": {
            "wspace": 0.5,
            "hspace": 0.5,
            "dspace": 1
        },
        "CenterCrop": {
            "size": [160, 160, 720]
	},
        "RandomAffine": {
            "degrees": 10,
            "scale": [0.3, 0.3, 0.3],
            "translate": [0.1, 0.1, 0.1],
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
        "ElasticTransform": {
			"alpha_range": [25.0, 35.0],
			"sigma_range":  [3.5, 4.5],
			"p": 0.5,
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
	"RandomReverse": {
	    "applied_to": ["im", "gt"],
	    "dataset_type": ["training"]
	},
	"RandomGamma": {
            "log_gamma_range": [-1.5, 1.5],
            "p": 0.5,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBiasField": {
            "coefficients": 0.5,
            "order": 3,
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBlur": {
            "sigma_range": [0.0, 1.0],
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "NumpyToTensor": {},
        "NormalizeInstance": {"applied_to": ["im"]}
    }
}

-->

Environment

System description

NeuroPoly server, Rosenberg, Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-53-generic x86_64)

Installed packages

on branch mhb/1213-fix-3d-data-augmentation from PR 1222

Output of pip freeze

absl-py==1.1.0
astor==0.8.1
astunparse==1.6.3
awscli==1.22.34
beniget==0.4.1
bids-validator==1.9.9
botocore==1.23.34
brz-etckeeper==0.0.0
cachetools==5.2.0
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
colorama==0.4.4
coloredlogs==15.0.1
command-not-found==0.3
commonmark==0.9.1
cryptography==3.4.8
csv-diff==1.1
cycler==0.11.0
dbus-python==1.2.18
decorator==4.4.2
Deprecated==1.2.13
dictdiffer==0.9.0
dill==0.3.5.1
distlib==0.3.4
distro==1.7.0
distro-info===1.1build1
dnspython==2.1.0
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.17.1
filelock==3.6.0
flatbuffers==2.0.7
fonttools==4.33.3
formulaic==0.3.4
fsleyes==1.5.0
fsleyes-props==1.8.2
fsleyes-widgets==0.12.3
fslpy==3.9.5
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.8.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
gpg===1.16.0-unknown
grpcio==1.47.0
h5py==3.7.0
humanfriendly==10.0
humanize==4.4.0
idna==3.3
imageio==2.22.4
importlib-metadata==4.6.4
interface-meta==1.3.0
iotop==0.6
-e git+https://github.com/ivadomed/ivadomed.git@d6385f1c57b7433a57003167c215f2288db3b631#egg=ivadomed
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
keras==2.11.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.3
libclang==14.0.1
loguru==0.6.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.2
more-itertools==8.10.0
mpmath==1.2.1
netifaces==0.11.0
networkx==2.8.8
nibabel==3.2.2
num2words==0.5.12
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
onnxruntime==1.13.1
opt-einsum==3.3.0
osfclient==0.0.5
packaging==21.3
pandas==1.4.4
pathtools==0.1.2
Pillow==9.0.1
platformdirs==2.5.1
ply==3.11
promise==2.3
protobuf==3.19.4
psutil==5.9.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybids==0.15.5
Pygments==2.11.2
PyGObject==3.42.1
PyOpenGL==3.1.6
pyparsing==2.4.7
python-apt==2.3.0+ubuntu2.1
python-dateutil==2.8.1
pythran==0.10.0
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
requests==2.25.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rich==12.6.0
roman==3.3
rsa==4.8
s3transfer==0.5.0
scikit-image==0.19.3
scikit-learn==1.2.0
scipy==1.8.0
screen-resolution-extra==0.0.0
seaborn==0.12.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.11
SimpleITK==2.2.1
six==1.16.0
smmap==5.0.0
SQLAlchemy==1.3.24
ssh-import-id==5.11
sympy==1.11.1
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0
torchaudio==0.13.0
torchio==0.18.86
torchvision==0.12.0
tqdm==4.64.0
typer==0.7.0
typing_extensions==4.2.0
ubuntu-drivers-common==0.0.0
ufw==0.36.1
unattended-upgrades==0.1
urllib3==1.26.13
virtualenv==20.13.0+ds
wandb==0.13.7
Werkzeug==2.1.2
wrapt==1.14.1
wxPython==4.0.7
xkit==0.0.0
zipp==1.0.0

Issue Analytics

State:
Created 9 months ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

jcohenadadcommented, Dec 19, 2022

Still, if you want the live mode (which is useful), we need to figure out what is wrong in your config. I don’t think it’s a network issue because I’m using the same computer and I don’t experience this issue.

1reaction

kiristerncommented, Dec 19, 2022

Thanks for the suggestion @kanishk16, not getting the error message after setting ‘mode’ = ‘offline’