Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to run/call call_variants when make_examples produces sharded outputs

See original GitHub issue

Hi, I’ve gotten a test run successfully through the make_example step with 64 shards, and have produced 64 examples and gvcf files like so:

-rw-r--r-- 1 root root   14394035 Feb  6 18:18 test.examples.tfrecord-00000-of-00064.gz
-rw-r--r-- 1 root root   16089657 Feb  6 18:18 test.examples.tfrecord-00001-of-00064.gz
-rw-r--r-- 1 root root   14238866 Feb  6 18:18 test.examples.tfrecord-00002-of-00064.gz
-rw-r--r-- 1 root root   14484530 Feb  6 18:19 test.examples.tfrecord-00003-of-00064.gz
...
-rw-r--r-- 1 root root   15225527 Feb  6 18:18 test.examples.tfrecord-00056-of-00064.gz
-rw-r--r-- 1 root root   14663343 Feb  6 18:19 test.examples.tfrecord-00057-of-00064.gz
-rw-r--r-- 1 root root   14571664 Feb  6 18:19 test.examples.tfrecord-00058-of-00064.gz
-rw-r--r-- 1 root root   13704439 Feb  6 18:19 test.examples.tfrecord-00059-of-00064.gz
-rw-r--r-- 1 root root   14383355 Feb  6 18:18 test.examples.tfrecord-00060-of-00064.gz
-rw-r--r-- 1 root root   13559255 Feb  6 18:19 test.examples.tfrecord-00061-of-00064.gz
-rw-r--r-- 1 root root   16376740 Feb  6 18:19 test.examples.tfrecord-00062-of-00064.gz
-rw-r--r-- 1 root root   15276769 Feb  6 18:18 test.examples.tfrecord-00063-of-00064.gz
-rw-r--r-- 1 root root    5842718 Feb  6 18:18 test.gvcf.tfrecord-00000-of-00064.gz
-rw-r--r-- 1 root root    5860574 Feb  6 18:18 test.gvcf.tfrecord-00001-of-00064.gz
-rw-r--r-- 1 root root    5852289 Feb  6 18:18 test.gvcf.tfrecord-00002-of-00064.gz
-rw-r--r-- 1 root root    5845856 Feb  6 18:19 test.gvcf.tfrecord-00003-of-00064.gz
-rw-r--r-- 1 root root    5834861 Feb  6 18:18 test.gvcf.tfrecord-00004-of-00064.gz
-rw-r--r-- 1 root root    5812744 Feb  6 18:18 test.gvcf.tfrecord-00005-of-00064.gz
-rw-r--r-- 1 root root    5856643 Feb  6 18:19 test.gvcf.tfrecord-00006-of-00064.gz
...
-rw-r--r-- 1 root root    5893279 Feb  6 18:19 test.gvcf.tfrecord-00054-of-00064.gz
-rw-r--r-- 1 root root    5850799 Feb  6 18:19 test.gvcf.tfrecord-00055-of-00064.gz
-rw-r--r-- 1 root root    5844041 Feb  6 18:18 test.gvcf.tfrecord-00056-of-00064.gz
-rw-r--r-- 1 root root    5816735 Feb  6 18:19 test.gvcf.tfrecord-00057-of-00064.gz
-rw-r--r-- 1 root root    5852875 Feb  6 18:19 test.gvcf.tfrecord-00058-of-00064.gz
-rw-r--r-- 1 root root    5820441 Feb  6 18:19 test.gvcf.tfrecord-00059-of-00064.gz
-rw-r--r-- 1 root root    5797526 Feb  6 18:18 test.gvcf.tfrecord-00060-of-00064.gz
-rw-r--r-- 1 root root    5893496 Feb  6 18:19 test.gvcf.tfrecord-00061-of-00064.gz
-rw-r--r-- 1 root root    5818504 Feb  6 18:19 test.gvcf.tfrecord-00062-of-00064.gz
-rw-r--r-- 1 root root    5831798 Feb  6 18:18 test.gvcf.tfrecord-00063-of-00064.gz

Surprisingly, this was generated using the following command:

 ## Run `make_examples`
    echo "Start running make_examples...Log will be in the terminal and also to make_examples.log."
    ( time seq 0 $((${numShards}-1)) | \
      parallel -k --line-buffer \
          /opt/deepvariant/bin/make_examples \
          --mode calling \
          --ref ${Fasta} \
          --reads reads.bam \
          --examples "${sample_id}.examples.tfrecord@${numShards}.gz" \
          --gvcf "${sample_id}.gvcf.tfrecord@${numShards}.gz" \
          --task {} \
    ) 2>&1 | tee "make_examples.log"
    echo "Done."
    echo

Which was based on this example: https://github.com/google/deepvariant/blob/r0.7/scripts/run_wgs_case_study_docker.sh

I would have expected the naming scheme to match the pattern I specified instead of the 000*-of-00064… strange.

Now I am trying to move on to the next step, but again having trouble figuring out how to deal with these multiple example files /sharding when passing them as inputs to the call_variants step.

In the example, it recommends:

## Run `call_variants`
echo "Start running call_variants...Log will be in the terminal and also to ${LOG_DIR}/call_variants.log."
( time sudo docker run \
    -v "${BASE}":"${BASE}" \
    gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/call_variants \
    --outfile "${CALL_VARIANTS_OUTPUT}" \
    --examples "${EXAMPLES}" \
    --checkpoint "${MODEL}"
) 2>&1 | tee "${LOG_DIR}/call_variants.log"
echo "Done."
echo

Is there some magic pattern recognition that knows to look for files of the format 000*-of-00064? Confused as to how I should do this; should I run call_variants on 64 separate machines, with each machine running a job on one of the sharded make_examples outputs? When I try incorporating the code recommended in the example workflow, I get the following error:

ValueError: Cannot find matching files with the pattern "test.examples.tfrecord@64.gz"

So obviously not working out of the box as specified. But I’m not sure whether call_variants is intelligent to handle sharded examples or if I should be explicitly only running it once on each shard and then somehow merging all the vcfs after or something. And where in this shading would post processing of variants fit in to generate the VCF – can that be part of a reduce step pulling all sharded call_variants outputs together one one machine? Any recommendations @pichuan @akolesnikov ?

Issue Analytics

State:
Created 5 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

pichuancommented, Feb 7, 2019

@ekofman Currently, the case studies (and corresponding scripts) are used to show an example of how to run DeepVariant. We showed an example of how to run it on a single machine, and didn’t focus on many other aspects such as how to pull multiple workers to orchestrate a distributed workflow, or how to run with GPU (which involves installing GPU driver, using the binaries that are built for GPU, etc). If you want to run on GPU, and if you have everything set up already (such as installing GPU driver correctly), you should be able to do it pretty much the same way. But instead of sudo docker pull gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}", you’ll pull from gcr.io/deepvariant-docker/deepvariant_gpu which is built for GPU. We have also documented it here: https://github.com/google/deepvariant/blob/r0.7/docs/deepvariant-details.md#call_variants in case you need to build the binaries yourself.

Note that even though using GPUs is faster, the overall cost might not be better depending on many other factors. Again, you can look at the GCP Cloud runner as an example of how they configure their run. If you end up doing more experiments to compare different configurations in your workflow, we would love to learn more about it as well. In addition to the GCP Cloud runner that @nmousavi 's team maintains, we also have seen other examples such as https://github.com/atgenomix/deepvariant-on-spark (and their WGS case study reports run time as well).

In terms of how much details we include on the DeepVariant GitHub page – Even though I’m personally very interested in the performance and cost of these implementations, I also need to consider the trade-off of the amount of details we include, because too much information can also end up being confusing. If you have more suggestions on how to organize the documentation better in the future, please let me know. Even now it’s already a bit messy and I would like to simplify it further. Thanks!

0reactions

ekofmancommented, Feb 7, 2019

@pichuan Thank for such a thorough description, really appreciate it. Super helpful to understand the different expectations and possible failure modes of each step. I will go back and make sure all of the preconditions you mentioned are met. One last comment though – you didn’t run call_variants on a GPU though it seems that is recommended; just makes it faster or is there any other reason to?