parse_sam_aux_fields conflict ValueError
See original GitHub issueDear developers! I am very curious to use DeepVariant on our in house data. In trying to do so, I stumbled upon an error I cannot seem to circumvent.
Problem: I am trying to run my bamfile that originated from a pacbio LAA output, mapped with minimap2. I receive the error that it’s unable to read any records. As I got the warGning (lol!) that --‘add_hp_channel’ is set but not ‘parse_sam_aux_fields’.
Initial command: sudo docker run -v “2021-05-11_deepvariant_PB”:“/input/” -v “2021-05-11_deepvariant_PB/output_DV”:“/output/” google/deepvariant:“1.1.0” /opt/deepvariant/bin/run_deepvariant --model_type=PACBIO --ref=/input/ref.fasta --reads=/input/R9_Z-1707-003_cluster1_RC492.bam --output_vcf=/output/output.vcf.gz
What I tried:
I tried to rerun with the following extra argument: --make_examples_extra_args=“parse_sam_aux_fields=true”.
This gives me the ValueError from run_deepvariant.py that it is in conflict with the sort_by_haplotypes flag, eventhough I didn’t use it. Then, I tried to add both arguments: --make_examples_extra_args=“sort_by_haplotypes=false,parse_sam_aux_fields=true”, but this gives the same ValueError.
ValueError: The extra_args "parse_sam_aux_fields" conflicts with other flags. Please fix and try again. Starting in v1.1.0, if you are running with PACBIO and want to use HP tags, please use the new --use_hp_information flag instead of using --make_examples_extra_args="sort_by_haplotypes=true,parse_sam_aux_fields=true"
I also tried to run the command with --sample_name=Z-1707-003_cluster1_RC492_phase0 (the RG for the bamfile), which does not give the warning anymore, but still leaves me with an empty vcf.
Tool stderr for the initial command:
I0511 12:24:29.658635 140614860437248 run_deepvariant.py:317] Re-using the directory for intermediate results in /tmp/tmpq5tvks3j
***** Intermediate results will be written to /tmp/tmpq5tvks3j in docker. ****
***** Running the command:*****
( time seq 0 0 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/input/ref.fasta" --reads "/input/R9_Z-1707-003_cluster1_RC492.bam" --examples "/tmp/tmpq5tvks3j/make_examples.tfrecord@1.gz" --add_hp_channel --alt_aligned_pileup "diff_channels" --noparse_sam_aux_fields --norealign_reads --nosort_by_haplotypes --vsc_min_fraction_indels "0.12" --task {} )
I0511 12:24:31.945842 140409179444992 genomics_reader.py:223] Reading /input/R9_Z-1707-003_cluster1_RC492.bam with NativeSamReader
W0511 12:24:31.946794 140409179444992 make_examples.py:589] WARGNING! --add_hp_channel is set but --parse_sam_aux_fields is not set. This will cause aux fields to not be read in. The relevant values might be zero. For example, for --add_hp_channel, resulting in an empty
HP channel. If this is not what you intended, please stop and enable --parse_sam_aux_fields.
I0511 12:24:32.430390 140409179444992 make_examples.py:648] Preparing inputs
I0511 12:24:32.438421 140409179444992 genomics_reader.py:223] Reading /input/R9_Z-1707-003_cluster1_RC492.bam with NativeSamReader
I0511 12:24:32.440476 140409179444992 make_examples.py:648] Common contigs are ['T86']
I0511 12:24:32.442919 140409179444992 make_examples.py:648] Starting from v0.9.0, --use_ref_for_cram is default to true. If you are using CRAM input, note that we will decode CRAM using the reference you passed in with --ref
2021-05-11 12:24:32.443393: I third_party/nucleus/io/sam_reader.cc:662] Setting HTS_OPT_BLOCK_SIZE to 134217728
I0511 12:24:32.447968 140409179444992 genomics_reader.py:223] Reading /input/R9_Z-1707-003_cluster1_RC492.bam with NativeSamReader
I0511 12:24:32.453339 140409179444992 genomics_reader.py:223] Reading /input/R9_Z-1707-003_cluster1_RC492.bam with NativeSamReader
I0511 12:24:32.579413 140409179444992 make_examples.py:648] Writing examples to /tmp/tmpq5tvks3j/make_examples.tfrecord-00000-of-00001.gz
I0511 12:24:32.579596 140409179444992 make_examples.py:648] Overhead for preparing inputs: 0 seconds
I0511 12:24:32.587054 140409179444992 make_examples.py:648] 0 candidates (0 examples) [0.01s elapsed]
I0511 12:24:32.591045 140409179444992 make_examples.py:648] Found 0 candidate variants
I0511 12:24:32.591111 140409179444992 make_examples.py:648] Created 0 examples
real 0m3.165s
user 0m3.133s
sys 0m1.450s
***** Running the command:*****
( time /opt/deepvariant/bin/call_variants --outfile "/tmp/tmpq5tvks3j/call_variants_output.tfrecord.gz" --examples "/tmp/tmpq5tvks3j/make_examples.tfrecord@1.gz" --checkpoint "/opt/models/pacbio/model.ckpt" )
W0511 12:24:34.935784 140411820246784 call_variants.py:327] Unable to read any records from /tmp/tmpq5tvks3j/make_examples.tfrecord@1.gz. Output will contain zero records.
real 0m2.355s
user 0m2.789s
sys 0m1.594s
***** Running the command:*****
( time /opt/deepvariant/bin/postprocess_variants --ref "/input/ref.fasta" --infile "/tmp/tmpq5tvks3j/call_variants_output.tfrecord.gz" --outfile "/output/output.vcf.gz" )
I0511 12:24:37.234371 139970945300224 postprocess_variants.py:1083] Could not determine sample name and --sample_name is unset. Using the default sample name. Sample name: default
I0511 12:24:37.235468 139970945300224 postprocess_variants.py:1111] call_variants_output is empty. Writing out empty VCF.
I0511 12:24:37.235656 139970945300224 postprocess_variants.py:1139] Writing variants to VCF.
I0511 12:24:37.235709 139970945300224 postprocess_variants.py:723] Writing output to VCF file: /output/output.vcf.gz
I0511 12:24:37.236480 139970945300224 genomics_writer.py:176] Writing /output/output.vcf.gz with NativeVcfWriter
I0511 12:24:37.237797 139970945300224 postprocess_variants.py:1147] VCF creation took 3.563165664672851e-05 minutes
I0511 12:24:37.239083 139970945300224 genomics_reader.py:223] Reading /output/output.vcf.gz with NativeVcfReader
real 0m2.472s
user 0m2.962s
sys 0m1.380
Thanks a lot!!
Issue Analytics
- State:
- Created 2 years ago
- Comments:11
Top GitHub Comments
Happy to help! For your question, it depends on how low the coverage is. You can see this blog post for how coverage impacts accuracy: https://google.github.io/deepvariant/posts/2019-09-10-twenty-is-the-new-thirty-comparing-current-and-historical-wgs-accuracy-across-coverage/
Hi @annabeldekker
I’ll paste some similar information from my answer in the other issue: https://github.com/google/deepvariant/issues/458#issuecomment-844317545. Hopefully my answer below will help you as well:
Starting from v1.1.0, we added an additional channel to our PacBio model, and tried to simplify the flags in the one-step
run_deepvariant
by adding just one flag--use_hp_information
, which you can set to false if you’re BAM is not phased, and set to true if your BAM is phased.Example: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-pacbio-model-case-study.md#run-deepvariant-on-haplotagged-chromosome-20-alignments
This
--use_hp_information
flag in the one-steprun_deepvariant
command actually controls bothsort_by_haplotypes
andparse_sam_aux_fields
in the make_examples stage. If you set--use_hp_information
to true in the one-steprun_deepvariant
command, that meanssort_by_haplotypes
andparse_sam_aux_fields
are both set to true in make_examples stage. And if you set--use_hp_information
to false, that meanssort_by_haplotypes
andparse_sam_aux_fields
are both set to false in make_examples stage.In both cases, if you’re running for PacBio, you always have to set
--add_hp_channel
to true in make_examples stage make sure the last channel is added. (If you’re using the one-steprun_deepvariant
command,--add_hp_channel
is automatically added).We tried our best to encaspulate these 3 flags into just one
--use_hp_information
in our one-steprun_deepvariant
command. However, I understand this might have caused further confusion when people tried to use the make_examples binary on its own. You can find the logic here: https://github.com/google/deepvariant/blob/r1.1/scripts/run_deepvariant.py#L240-L242I will try to update our deepvariant-pacbio-model-case-study.md file to document this.