question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDNN_STATUS_INTERNAL_ERROR

See original GitHub issue

Describe the issue: /opt/deepvariant/bin/run_deepvariant crashes when start the GPU stage of call variants.

Setup google/deepvariant:0.10.0 Docker subset of illumina resequencing data $nvcc --version nvcc: NVIDIA ® Cuda compiler driver Copyright © 2005-2017 NVIDIA Corporation Built on Fri_Nov__3_21:07:56_CDT_2017 Cuda compilation tools, release 9.1, V9.1.85 $ nvidia-smi NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 GeForce RTX 2070 super.

Workaround Apparently the gpu module is consuming all my memmory (8gb), possilbe " config.gpu_options.allow_growth = True" not present in the script?

Command line

BIN_VERSION="1.0.0" BASE="${PWD}/deepvariant-run" INPUT_DIR="${BASE}/input" REF="10consensus.fasta" REF2="reftst.fa" BAM="268_041_m10.sorted.bam" BAM2="tst.sorted.bam" OUTPUT_DIR="${BASE}/output" DATA_DIR="${INPUT_DIR}/data" OUTPUT_VCF="M10.output.vcf.gz" OUTPUT_VCF2="TST.output.vcf.gz" OUTPUT_GVCF="M10.output.g.vcf.gz" OUTPUT_GVCF2="TST.output.g.vcf.gz" sudo docker run --gpus 1 -v "${DATA_DIR}":"/input" -v "${OUTPUT_DIR}:/output" google/deepvariant:"${BIN_VERSION}-gpu" /opt/deepvariant/bin/run_deepvariant --model_type=WGS --ref="/input/${REF2}" --reads="/input/${BAM2}" --output_vcf=/output/${OUTPUT_VCF} --output_gvcf=/output/${OUTPUT_GVCF} --intermediate_results_dir /output/intermediate_results_dir --num_shards=30

Error trace

2020-09-24 03:47:35.386802: W third_party/nucleus/io/sam_reader.cc:534] Could not read base quality scores GWNJ-1012:204:GW191209000:1:1101:22544:2049: Not found: Could not read base quality scores I0924 03:47:35.394492 139826099087104 make_examples.py:587] Task 28/30: Found 88 candidate variants I0924 03:47:35.394706 139826099087104 make_examples.py:587] Task 28/30: Created 88 examples I0924 03:47:35.416212 139915800631040 make_examples.py:587] Task 9/30: Found 74 candidate variants I0924 03:47:35.416471 139915800631040 make_examples.py:587] Task 9/30: Created 76 examples I0924 03:47:35.441959 139746083813120 make_examples.py:587] Task 29/30: Found 78 candidate variants I0924 03:47:35.442209 139746083813120 make_examples.py:587] Task 29/30: Created 78 examples

real 0m5.429s user 2m1.568s sys 0m23.089s

***** Running the command:*****

time /opt/deepvariant/bin/call_variants --outfile “/output/intermediate_results_dir/call_variants_output.tfrecord.gz” --examples “/output/intermediate_results_dir/make_examples.tfrecord@30.gz” --checkpoint “/opt/models/wgs/model.ckpt”

I0924 03:47:37.408303 140325876573952 call_variants.py:335] Shape of input examples: [100, 221, 6] 2020-09-24 03:47:37.413854: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-09-24 03:47:37.437208: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz 2020-09-24 03:47:37.440001: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e41920 executing computations on platform Host. Devices: 2020-09-24 03:47:37.440048: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2020-09-24 03:47:37.444991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-09-24 03:47:37.554617: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5ea0f10 executing computations on platform CUDA. Devices: 2020-09-24 03:47:37.554679: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5 2020-09-24 03:47:37.556109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 2020-09-24 03:47:37.556612: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-09-24 03:47:37.559375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-09-24 03:47:37.561650: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-09-24 03:47:37.562295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-09-24 03:47:37.565509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-09-24 03:47:37.567974: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-09-24 03:47:37.574763: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-09-24 03:47:37.576204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-09-24 03:47:37.576265: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-09-24 03:47:37.577441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-09-24 03:47:37.577462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-09-24 03:47:37.577470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-09-24 03:47:37.578993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6199 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:21:00.0, compute capability: 7.5) W0924 03:47:37.676500 140325876573952 estimator.py:1821] Using temporary folder as model directory: /tmp/tmp3gvrq0ei I0924 03:47:37.676881 140325876573952 estimator.py:212] Using config: {‘_model_dir’: ‘/tmp/tmp3gvrq0ei’, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: 600, ‘_session_config’: , ‘_keep_checkpoint_max’: 100000, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_log_step_count_steps’: 100, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f898d3630>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1} I0924 03:47:37.677164 140325876573952 call_variants.py:426] Writing calls to /output/intermediate_results_dir/call_variants_output.tfrecord.gz W0924 03:47:37.681965 140325876573952 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. W0924 03:47:37.690693 140325876573952 deprecation.py:323] From /tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/data_providers.py:375: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_determinstic. W0924 03:47:37.814187 140325876573952 deprecation.py:323] From /tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/data_providers.py:381: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map(map_func, num_parallel_calls) followed by tf.data.Dataset.batch(batch_size, drop_remainder). Static tf.data optimizations will take care of using the fused implementation. I0924 03:47:38.164505 140325876573952 estimator.py:1147] Calling model_fn. W0924 03:47:38.168455 140325876573952 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tf_slim/layers/layers.py:1089: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use layer.__call__ method instead. I0924 03:47:41.667636 140325876573952 estimator.py:1149] Done calling model_fn. I0924 03:47:42.548214 140325876573952 monitored_session.py:240] Graph was finalized. 2020-09-24 03:47:42.549039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 2020-09-24 03:47:42.549107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-09-24 03:47:42.549121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-09-24 03:47:42.549131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-09-24 03:47:42.549143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-09-24 03:47:42.549151: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-09-24 03:47:42.549164: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-09-24 03:47:42.549174: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-09-24 03:47:42.549558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-09-24 03:47:42.549586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-09-24 03:47:42.549595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-09-24 03:47:42.549601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-09-24 03:47:42.549975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6199 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:21:00.0, compute capability: 7.5) I0924 03:47:42.550738 140325876573952 saver.py:1284] Restoring parameters from /opt/models/wgs/model.ckpt I0924 03:47:43.702764 140325876573952 session_manager.py:500] Running local_init_op. I0924 03:47:43.766339 140325876573952 session_manager.py:502] Done running local_init_op. I0924 03:47:44.184749 140325876573952 modeling.py:415] Reloading EMA… I0924 03:47:44.185623 140325876573952 saver.py:1284] Restoring parameters from /opt/models/wgs/model.ckpt 2020-09-24 03:47:45.236844: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-09-24 03:47:45.652085: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-09-24 03:47:45.654628: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call return fn(*args) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn target_list, run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D}}]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D}}]] [[softmax_tensor_1/_3035]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 491, in <module> tf.compat.v1.app.run() File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py”, line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/absl_py/absl/app.py”, line 300, in run _run_main(main, args) File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/absl_py/absl/app.py”, line 251, in _run_main sys.exit(main(argv)) File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 481, in main use_tpu=FLAGS.use_tpu, File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 433, in call_variants prediction = next(predictions) File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 640, in predict preds_evaluated = mon_sess.run(predictions) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run run_metadata=run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1259, in run run_metadata=run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run raise six.reraise(*original_exc_info) File “/tmp/Bazel.runfiles_dgqnmzud/runfiles/six_archive/six.py”, line 686, in reraise raise value File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run return self._sess.run(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run run_metadata=run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run return self._sess.run(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run run_metadata_ptr) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run feed_dict_tensor, options, run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run run_metadata) File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [[softmax_tensor_1/_3035]] 0 successful operations. 0 derived errors ignored.

Original stack trace for ‘InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D’: File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 491, in <module> tf.compat.v1.app.run() File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py”, line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/absl_py/absl/app.py”, line 300, in run _run_main(main, args) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/absl_py/absl/app.py”, line 251, in _run_main sys.exit(main(argv)) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 481, in main use_tpu=FLAGS.use_tpu, File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/call_variants.py”, line 433, in call_variants prediction = next(predictions) File “usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 622, in predict features, None, ModeKeys.PREDICT, self.config) File “usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1148, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/modeling.py”, line 914, in model_fn is_training=mode == tf.estimator.ModeKeys.TRAIN) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/modeling.py”, line 744, in create return self._create(images, num_classes, is_training) File “tmp/Bazel.runfiles_dgqnmzud/runfiles/com_google_deepvariant/deepvariant/modeling.py”, line 1122, in _create images, num_classes, create_aux_logits=False, is_training=is_training) File “usr/local/lib/python3.6/dist-packages/tf_slim/nets/inception_v3.py”, line 587, in inception_v3 depth_multiplier=depth_multiplier) File “usr/local/lib/python3.6/dist-packages/tf_slim/nets/inception_v3.py”, line 117, in inception_v3_base net = layers.conv2d(inputs, depth(32), [3, 3], stride=2, scope=end_point) File “usr/local/lib/python3.6/dist-packages/tf_slim/ops/arg_scope.py”, line 184, in func_with_args return func(*args, **current_args) File “usr/local/lib/python3.6/dist-packages/tf_slim/layers/layers.py”, line 1191, in convolution2d conv_dims=2) File “usr/local/lib/python3.6/dist-packages/tf_slim/ops/arg_scope.py”, line 184, in func_with_args return func(*args, **current_args) File “usr/local/lib/python3.6/dist-packages/tf_slim/layers/layers.py”, line 1089, in convolution outputs = layer.apply(inputs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 324, in new_func return func(*args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 1695, in apply return self.call(inputs, *args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py”, line 548, in call outputs = super(Layer, self).call(inputs, *args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 847, in call outputs = call_fn(cast_inputs, *args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper return converted_call(f, options, args, kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted return f(*args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py”, line 197, in call outputs = self._convolution_op(inputs, self.kernel) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 1134, in call return self.conv_op(inp, filter) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 639, in call return self.call(inp, filter) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 238, in call name=self.name) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py”, line 2010, in conv2d name=name) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py”, line 1071, in conv2d data_format=data_format, dilations=dilations, name=name) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py”, line 793, in _apply_op_helper op_def=op_def) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func return func(*args, **kwargs) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3360, in create_op attrs, op_def, compute_device) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3429, in _create_op_internal op_def=op_def) File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1751, in init self._traceback = tf_stack.extract_stack()

real 0m10.613s user 0m11.112s sys 0m4.718s I0924 03:47:46.482943 140410383501056 run_deepvariant.py:364] None Traceback (most recent call last): File “/opt/deepvariant/bin/run_deepvariant.py”, line 369, in <module> app.run(main) File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 299, in run _run_main(main, args) File “/usr/local/lib/python3.6/dist-packages/absl/app.py”, line 250, in _run_main sys.exit(main(argv)) File “/opt/deepvariant/bin/run_deepvariant.py”, line 362, in main subprocess.check_call(command, shell=True, executable=‘/bin/bash’) File “/usr/lib/python3.6/subprocess.py”, line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command ‘time /opt/deepvariant/bin/call_variants --outfile “/output/intermediate_results_dir/call_variants_output.tfrecord.gz” --examples “/output/intermediate_results_dir/make_examples.tfrecord@30.gz” --checkpoint “/opt/models/wgs/model.ckpt”’ returned non-zero exit status 1.

falllowing my ndvida-smi it consumes all the memmory, htere is a way to limit memmory?

Cheers.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
leorippelcommented, Oct 13, 2020

Hi @leorippel, you can pass the desired argument to run_deepvariant.py using the below command:

sudo docker run --gpus 1 \
  -v "${DATA_DIR}":"/input" \
  -v "${OUTPUT_DIR}:/output" \
  google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="/input/${REF2}" \
  --reads="/input/${BAM2}" \
  --output_vcf=/output/${OUTPUT_VCF} \
  --output_gvcf=/output/${OUTPUT_GVCF} \
  --intermediate_results_dir /output/intermediate_results_dir \
  --num_shards=30 \
  --call_variants_extra_args="config_string='gpu_options: {allow_growth: True}'"

Specifically, I added: --call_variants_extra_args="config_string='gpu_options: {allow_growth: True}'"

Does adding this fix the memory issue?

Yes, it fixed the issue. Thanks!

0reactions
leorippelcommented, Oct 13, 2020

@leorippel got it, thanks for the clarification! I’m not sure what the issue is in this case, but if you are able to use a machine with more memory, that would be the easiest option. Another option is to shard your data by chromosome and run DeepVariant separately for each chromosome (or groups of chromosomes).

I took the intermediate results and ran into a bigger machine. It worked. The memory consumption rised to 435G !! by far the hungry step. Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

could not create cudnn handle
... tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR onset is the moment GPU memory nearly full.
Read more >
Could not create cudnn handle ...
I manage to install the latest version of DLC-GPU-LITE, but it does not use the GPU. Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR ......
Read more >
Could not create cudnn handle
Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR ... Hello,. I'm trying to compile, for cuDDN bu i got this error here. Can some one ......
Read more >
Could not create cudnn handle
Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR. Run the following rm -rf ~/.nv/ and set config.gpu_options.allow_growth = True.
Read more >
nlp - PyTorch Forums
LSTM memory allocate cause CUDNN_STATUS_INTERNAL_ERROR ... I think i'm figure out why this exception happens, but I've no idea how to solve it....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found