How to run/call call_variants when make_examples produces sharded outputs
See original GitHub issueHi, I’ve gotten a test run successfully through the make_example step with 64 shards, and have produced 64 examples and gvcf files like so:
-rw-r--r-- 1 root root 14394035 Feb 6 18:18 test.examples.tfrecord-00000-of-00064.gz
-rw-r--r-- 1 root root 16089657 Feb 6 18:18 test.examples.tfrecord-00001-of-00064.gz
-rw-r--r-- 1 root root 14238866 Feb 6 18:18 test.examples.tfrecord-00002-of-00064.gz
-rw-r--r-- 1 root root 14484530 Feb 6 18:19 test.examples.tfrecord-00003-of-00064.gz
...
-rw-r--r-- 1 root root 15225527 Feb 6 18:18 test.examples.tfrecord-00056-of-00064.gz
-rw-r--r-- 1 root root 14663343 Feb 6 18:19 test.examples.tfrecord-00057-of-00064.gz
-rw-r--r-- 1 root root 14571664 Feb 6 18:19 test.examples.tfrecord-00058-of-00064.gz
-rw-r--r-- 1 root root 13704439 Feb 6 18:19 test.examples.tfrecord-00059-of-00064.gz
-rw-r--r-- 1 root root 14383355 Feb 6 18:18 test.examples.tfrecord-00060-of-00064.gz
-rw-r--r-- 1 root root 13559255 Feb 6 18:19 test.examples.tfrecord-00061-of-00064.gz
-rw-r--r-- 1 root root 16376740 Feb 6 18:19 test.examples.tfrecord-00062-of-00064.gz
-rw-r--r-- 1 root root 15276769 Feb 6 18:18 test.examples.tfrecord-00063-of-00064.gz
-rw-r--r-- 1 root root 5842718 Feb 6 18:18 test.gvcf.tfrecord-00000-of-00064.gz
-rw-r--r-- 1 root root 5860574 Feb 6 18:18 test.gvcf.tfrecord-00001-of-00064.gz
-rw-r--r-- 1 root root 5852289 Feb 6 18:18 test.gvcf.tfrecord-00002-of-00064.gz
-rw-r--r-- 1 root root 5845856 Feb 6 18:19 test.gvcf.tfrecord-00003-of-00064.gz
-rw-r--r-- 1 root root 5834861 Feb 6 18:18 test.gvcf.tfrecord-00004-of-00064.gz
-rw-r--r-- 1 root root 5812744 Feb 6 18:18 test.gvcf.tfrecord-00005-of-00064.gz
-rw-r--r-- 1 root root 5856643 Feb 6 18:19 test.gvcf.tfrecord-00006-of-00064.gz
...
-rw-r--r-- 1 root root 5893279 Feb 6 18:19 test.gvcf.tfrecord-00054-of-00064.gz
-rw-r--r-- 1 root root 5850799 Feb 6 18:19 test.gvcf.tfrecord-00055-of-00064.gz
-rw-r--r-- 1 root root 5844041 Feb 6 18:18 test.gvcf.tfrecord-00056-of-00064.gz
-rw-r--r-- 1 root root 5816735 Feb 6 18:19 test.gvcf.tfrecord-00057-of-00064.gz
-rw-r--r-- 1 root root 5852875 Feb 6 18:19 test.gvcf.tfrecord-00058-of-00064.gz
-rw-r--r-- 1 root root 5820441 Feb 6 18:19 test.gvcf.tfrecord-00059-of-00064.gz
-rw-r--r-- 1 root root 5797526 Feb 6 18:18 test.gvcf.tfrecord-00060-of-00064.gz
-rw-r--r-- 1 root root 5893496 Feb 6 18:19 test.gvcf.tfrecord-00061-of-00064.gz
-rw-r--r-- 1 root root 5818504 Feb 6 18:19 test.gvcf.tfrecord-00062-of-00064.gz
-rw-r--r-- 1 root root 5831798 Feb 6 18:18 test.gvcf.tfrecord-00063-of-00064.gz
Surprisingly, this was generated using the following command:
## Run `make_examples`
echo "Start running make_examples...Log will be in the terminal and also to make_examples.log."
( time seq 0 $((${numShards}-1)) | \
parallel -k --line-buffer \
/opt/deepvariant/bin/make_examples \
--mode calling \
--ref ${Fasta} \
--reads reads.bam \
--examples "${sample_id}.examples.tfrecord@${numShards}.gz" \
--gvcf "${sample_id}.gvcf.tfrecord@${numShards}.gz" \
--task {} \
) 2>&1 | tee "make_examples.log"
echo "Done."
echo
Which was based on this example: https://github.com/google/deepvariant/blob/r0.7/scripts/run_wgs_case_study_docker.sh
I would have expected the naming scheme to match the pattern I specified instead of the 000*-of-00064… strange.
Now I am trying to move on to the next step, but again having trouble figuring out how to deal with these multiple example files /sharding when passing them as inputs to the call_variants step.
In the example, it recommends:
## Run `call_variants`
echo "Start running call_variants...Log will be in the terminal and also to ${LOG_DIR}/call_variants.log."
( time sudo docker run \
-v "${BASE}":"${BASE}" \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/call_variants \
--outfile "${CALL_VARIANTS_OUTPUT}" \
--examples "${EXAMPLES}" \
--checkpoint "${MODEL}"
) 2>&1 | tee "${LOG_DIR}/call_variants.log"
echo "Done."
echo
Is there some magic pattern recognition that knows to look for files of the format 000*-of-00064? Confused as to how I should do this; should I run call_variants on 64 separate machines, with each machine running a job on one of the sharded make_examples outputs? When I try incorporating the code recommended in the example workflow, I get the following error:
ValueError: Cannot find matching files with the pattern "test.examples.tfrecord@64.gz"
So obviously not working out of the box as specified. But I’m not sure whether call_variants is intelligent to handle sharded examples or if I should be explicitly only running it once on each shard and then somehow merging all the vcfs after or something. And where in this shading would post processing of variants fit in to generate the VCF – can that be part of a reduce step pulling all sharded call_variants outputs together one one machine? Any recommendations @pichuan @akolesnikov ?
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (1 by maintainers)
Top GitHub Comments
@ekofman Currently, the case studies (and corresponding scripts) are used to show an example of how to run DeepVariant. We showed an example of how to run it on a single machine, and didn’t focus on many other aspects such as how to pull multiple workers to orchestrate a distributed workflow, or how to run with GPU (which involves installing GPU driver, using the binaries that are built for GPU, etc). If you want to run on GPU, and if you have everything set up already (such as installing GPU driver correctly), you should be able to do it pretty much the same way. But instead of
sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"
, you’ll pull from gcr.io/deepvariant-docker/deepvariant_gpu which is built for GPU. We have also documented it here: https://github.com/google/deepvariant/blob/r0.7/docs/deepvariant-details.md#call_variants in case you need to build the binaries yourself.Note that even though using GPUs is faster, the overall cost might not be better depending on many other factors. Again, you can look at the GCP Cloud runner as an example of how they configure their run. If you end up doing more experiments to compare different configurations in your workflow, we would love to learn more about it as well. In addition to the GCP Cloud runner that @nmousavi 's team maintains, we also have seen other examples such as https://github.com/atgenomix/deepvariant-on-spark (and their WGS case study reports run time as well).
In terms of how much details we include on the DeepVariant GitHub page – Even though I’m personally very interested in the performance and cost of these implementations, I also need to consider the trade-off of the amount of details we include, because too much information can also end up being confusing. If you have more suggestions on how to organize the documentation better in the future, please let me know. Even now it’s already a bit messy and I would like to simplify it further. Thanks!
@pichuan Thank for such a thorough description, really appreciate it. Super helpful to understand the different expectations and possible failure modes of each step. I will go back and make sure all of the preconditions you mentioned are met. One last comment though – you didn’t run call_variants on a GPU though it seems that is recommended; just makes it faster or is there any other reason to?