Varying the channels used to call variants
See original GitHub issueDescribe the issue:
I previously used the following PopVCF model.ckpt with run_deepvariant
v.1.1 while including a PopVCF channel during make_examples
. However, that model does not include a channel for insert_size
as their work predates v1.4.
With the default extra channel for 'insert_size'
in v1.4, and make_examples
having numerous options to include additional channels:
--[no]use_allele_frequency: If True, add another channel for pileup images to represent allele frequency information gathered from population call sets.
(default: 'false')
--[no]add_hp_channel: If true, add another channel to represent HP tags per read.
(default: 'false')
--channels: Comma-delimited list of optional channels to add. Available Channels: read_mapping_percent,avg_base_quality,identity,gap_compressed_identity,gc_content,is_homopolymer,homopolymer_weighted,blank,insert_size
Are there model-ckpt
files for these channel options available somewhere to provide call_variants
via:
--checkpoint: Required. Path to the TensorFlow model checkpoint to use to evaluate candidate variant calls.
If so, do they include one additional channel or permutations of multiple channels?
If not, is there an alternative way to have run_deepvariant
use different channels than what the default checkpoint contains during call_variants
? For example, I am currently unable to include both insert_size
and allele_frequency
with v1.4
Setup
- Operating system:
- DeepVariant version: v1.4
- Installation method (Docker, built from source, etc.): Singularity
- Type of data: WGS
Steps to reproduce:
- Command:
time singularity run -B '/usr/lib/locale/:/usr/lib/locale/,/path/to/region_files/:/region_dir/,/path/to/container/deep-variant/:/run_dir/,/path/to/output/:/path/to/reference_genome/:/ref_dir/,/path/to/bam_files/:/bam_dir/,/path/to/population_vcf/:/popVCF_dir/'
deepvariant_1.4.0.sif
/opt/deepvariant/bin/run_deepvariant
--model_type=WGS
--ref='/ref_dir/reference.fa'
--reads='/bam_dir/id.bam'
--output_vcf='/out_dir/test1.vcf.gz'
--intermediate_results_dir='/out_dir/tmp/test1/'
--num_shards='39'
--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/popVCF_dir/UMAG1.POP.FREQ.vcf.gz"
--regions=/region_dir/regions_to_test.bed
- Error trace: (if applicable)
***** Running the command:*****
time /opt/deepvariant/bin/call_variants --outfile "/out_dir/tmp/test1/call_variants_output.tfrecord.gz" --examples "/out_dir/tmp/test1/make_examples.tfrecord@39.gz" --checkpoint "/opt/models/wgs/model.ckpt" --openvino_model_dir "/out_dir/tmp/test1/"
I0919 17:19:47.185331 46912500266816 call_variants.py:317] From /out_dir/tmp/test1/make_examples.tfrecord-00000-of-00039.gz.example_info.json: Shape of input examples: [100, 221, 8], Channels of input examples: [1, 2, 3, 4, 5, 6, 8, 19].
Traceback (most recent call last):
File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 513, in <module>
tf.compat.v1.app.run()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/tmp/Bazel.runfiles_l3__pco1/runfiles/absl_py/absl/app.py", line 300, in run
_run_main(main, args)
File "/tmp/Bazel.runfiles_l3__pco1/runfiles/absl_py/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 494, in main
call_variants(
File "/tmp/Bazel.runfiles_l3__pco1/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 363, in call_variants
raise ValueError('The number of channels in examples and checkpoint '
ValueError: The number of channels in examples and checkpoint should match, but the checkpoint has 7 channels while the examples have 8.
real 0m3.217s
user 0m4.066s
sys 0m4.174s
real 77m45.059s
user 2960m49.979s
sys 39m40.911s```
Issue Analytics
- State:
- Created a year ago
- Comments:6
Top GitHub Comments
Hi @jkalleberg , please see See: https://gist.github.com/pichuan/7ad09bf1fa8f519facf6806eca835ea6
I’ll close this issue for now. Feel free to open more issues if you have any questions or feedback for us.
New model checkpoints associated with new releases will be under gs://deepvariant/models/DeepVariant as you noticed.
I mentioned that starting from v1.4.0, you can see this file:
The “channels” values are enums. You can look them up in this proto: https://github.com/google/deepvariant/blob/r1.4/deepvariant/protos/deepvariant.proto#L1048
From the example above, it’s saying that DeepVariant v1.4.0 WGS model has 7 channels, and they are:
Note that the allele frequency model isn’t part of our regular release process yet. It’s made public as part of our preprint https://doi.org/10.1101/2021.01.06.425550. Right now, we’re retraining it when users request it. We’re certainly hoping to see more uses cases (thank you for letting us know!). If it’s become more mature, we can consider building it into part of our regular release process. (Adding more regular supports also means more overhead for each release, so we need to balance this carefully.)