question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with singularity gpu

See original GitHub issue

Hello,

I’m trying to debug my installation of the singularity GPU version for a new C4140 GPU node with Tesla V100s. I’ve run the CPU version successfully in production and am very happy with it, but the shift to GPU is giving me trouble, likely running into an issue with CUDA or TensorFlow.

I have several CUDA modules loaded, but perhaps I’m missing one of the key libraries? I have TensorFlow in a conda environment (although that’s probably satisfied inside the singularity image)?

Here’s the code I’m running from the Quickstart:

OUTPUT_DIR="${PWD}/quickstart-output"
INPUT_DIR="${PWD}/quickstart-testdata"
mkdir -p "${OUTPUT_DIR}"

BIN_VERSION="1.3.0"

# Load modules
module load singularity
module load cuda-dcgm/2.2.9.1
module load cuda11.4/toolkit
module load cuda11.4/blas
module load cuda11.4/nsight
module load cuda11.4/profiler
module load cuda11.4/fft
source /mnt/common/Precision/Miniconda3/opt/miniconda3/etc/profile.d/conda.sh
conda activate TensorFlow_GPU

# Pull the image.
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"


# Run
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
  --nv \
  docker://google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
  --reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
  --regions "chr20:10,000,000-10,010,000" \
  --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
  --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
  --intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir"

And here’s my error:

2022-02-07 11:50:52.952780: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 48, in <module>
    import tensorflow as tf
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/__init__.py", line 444, in <module>
    _ll.load_library(_main_dir)
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 154, in load_library
    py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

I’m wondering if this error can help highlight the error I’m experiencing?

Is there something I can run with CUDA to test that implementation on our new GPU server?

Thanks! Phil

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
Phillip-a-richmondcommented, Jun 15, 2022

I was able to get around this issue with my version of singularity (3.4.2) by cleaning the environment, limiting what’s passed to singularity from the environment, and setting the tmp dir explicitly in the working directory on the NFS.

here’s my code chunk:

WORKING_DIR=/mnt/scratch/Precision/Hub/PROCESS/DH4749/
export SINGULARITY_CACHEDIR=$WORKING_DIR
export SINGULARITY_TMPDIR=$WORKING_DIR/tmp/
mkdir -p $WORKING_DIR/tmp/

singularity exec \
	-e \
	-c \
	-H $WORKING_DIR \
	-B $WORKING_DIR/tmp:/tmp \
	-B /usr/lib/locale/:/usr/lib/locale/ \
	-B "${BAM_DIR}":"/bamdir" \
	-B "${FASTA_DIR}":"/genomedir" \
	-B "${OUTPUT_DIR}":"/output" \
	docker://google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WES \
  --ref="/genomedir/$FASTA_FILE" \
  --reads="/bamdir/$PROBAND_BAM" \
  --output_vcf="/output/$PROBAND_VCF" \
  --output_gvcf="/output/$PROBAND_GVCF" \
  --intermediate_results_dir="/output/intermediate" \
  --num_shards=$NSLOTS 

With the newer versions of singularity I think they do less inclusion of environmental variables, which includes the PYTHONPATH among other things in home directory and /usr/local/src…which is why you couldn’t reproduce the error on a fresh cloud deployment.

Can keep closed just figured it out on my end…may be useful to someone with same issue on shared HPC with older singularity versions.

0reactions
pichuancommented, Feb 14, 2022

Hi @Phillip-a-richmond , if you have any suggestions on how to reproduce this issue, please let me know. I’ll close this for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Singularity on WSL2 does not properly support NVIDIA GPU
Version of Singularity Singularity v3.9.6 Describe the bug When using WSL2 on Windows 10 21H2, GPU support is passed to the Linux VM....
Read more >
GPU Support (NVIDIA CUDA & AMD ROCm)
Singularity natively supports running application containers that use NVIDIA's CUDA GPU compute framework, or AMD's ROCm solution.
Read more >
Unable to run CUDA programs in Singularity containers ...
I am experiencing some problems running programs compiled in a container built from an NGC CUDA image: $ srun singularity pull ...
Read more >
How to use GPU in singularity?
The driver script that calls others is build.sh. I found it the easiest to work with Scientific Linux 7, I had some issues...
Read more >
Nvidia modulus singularity run on gpu - Jingchao's Website
Fix memory explosion issue. Need to edit source code in the container. singularity shell --nv --writable --bind .:/mnt /home/jingchao.zhang ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found