error while using "gmi" for the loss
See original GitHub issueSubject of the issue
getting tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary]
while trying to use ‘gmi’ in several scenarios (e.g. in the demos)
If the bug is confirmed, would you be willing to submit a PR? (Help can be provided if you need assistance submitting a PR)
No
Your environment
-
DeepReg version (commit hash or tag): 0.1.0b1 (from
git rev-parse HEAD
: 8b8d75fdaaf89be2dfefc1d5c3c37e3ef26fd7d1) -
OS: Linux 4.15.0-112-generic #113-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
-
Python Version: 3.7.9
-
TensorFlow: 2.2.0
Steps to reproduce
modified the grouped_mr_heart demo yaml file with ‘gmi’ instead of ‘lncc’ and then run
deepreg_train --gpu "3" --config_path demos/grouped_mr_heart/grouped_mr_heart.yaml --log_dir grouped_mr_heart
log
1/9 [==>...........................] - ETA: 0s - loss/weighted_regularization: 0.0000e+00 - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: 0.0000e+00 - loss/image_dissimilarity: nan2020-10-15 08:42:22.326944: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-10-15 08:42:22.330619: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216] GpuTracer has collected 0 callback api events and 0 activity events.
2020-10-15 08:42:22.349700: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22
2020-10-15 08:42:22.352329: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.trace.json.gz
2020-10-15 08:42:22.353773: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.001 ms
2020-10-15 08:42:22.355437: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22Dumped tool data for overview_page.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.overview_page.pb
Dumped tool data for input_pipeline.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs/grouped_mr_heart/train/plugins/profile/2020_10_15_08_42_22/MMIV-DGX-Station2.kernel_stats.pb
2/9 [=====>........................] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar3/9 [=========>....................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar4/9 [============>.................] - ETA: 3s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar5/9 [===============>..............] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar6/9 [===================>..........] - ETA: 2s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar7/9 [======================>.......] - ETA: 1s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar8/9 [=========================>....] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilar9/9 [==============================] - ETA: 0s - loss/weighted_regularization: nan - loss: nan - loss/weighted_image_dissimilarity: nan - loss/regularization: nan - loss/image_dissimilarity: nan2020-10-15 08:42:34.992438: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at summary_kernels.cc:242 : Invalid argument: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0
Traceback (most recent call last):
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 464, in write_histogram_summary
tld.op_callbacks, writer, step, tag, values)
tensorflow.python.eager.core._FallbackException: This function does not handle the case of the path where all inputs are not already EagerTensors.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/charlie/anaconda3/envs/deepreg/bin/deepreg_train", line 33, in <module>
sys.exit(load_entry_point('deepreg', 'console_scripts', 'deepreg_train')())
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 227, in main
log_dir=args.log_dir,
File "/home/charlie/3DREG-tests/DeepReg/deepreg/train.py", line 154, in train
callbacks=callbacks,
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 876, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 365, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2000, in on_epoch_end
self._log_weights(epoch)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 2119, in _log_weights
summary_ops_v2.histogram(weight_name, weight, step=epoch)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 830, in histogram
return summary_writer_function(name, tensor, function, family=family)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 759, in summary_writer_function
should_record_summaries(), record, _nothing, name="")
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
return true_fn()
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 752, in record
with ops.control_dependencies([function(tag, scope)]):
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 828, in function
name=scope)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 469, in write_histogram_summary
writer, step, tag, values, name=name, ctx=_ctx)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 490, in write_histogram_summary_eager_fallback
attrs=_attrs, ctx=ctx, name=name)
File "/home/charlie/anaconda3/envs/deepreg/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: local_net/down_sample_resnet_block/conv3d_block/conv3d/conv3d_1/kernel_0 [Op:WriteHistogramSummary]
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:17 (11 by maintainers)
Top GitHub Comments
Hi @ciphercharly the fix has been integrated into the
main
branch now, feel free to test again 😉 Please reopen this ticket if there’s still error!tested quickly, seems to run without errors with custom model/data too 👍