Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing CB or UB with umi-tools count

See original GitHub issue

Hi, thanks for the useful software!

I’m trying to recount a bam file generated via 10x, but I’m getting errors since my bam file does not have the 10x-corrected cell-barcode or umi-barcode tag for every read (CB and UB, respectively)

FILE="neurons_900_possorted_genome_bam"
umi_tools count \
--extract-umi-method="tag" \
--per-gene \
--per-cell \
--umi-tag=UB \
--cell-tag=CB \
--mapping-quality=255 \
--gene-tag=GN \
--skip-tags-regex=";" \
--method="unique" \
--per-cell -I $FILE.bam -S counts.tsv.gz
# output generated by count --extract-umi-method=tag --per-gene --per-cell --umi-tag=UB --cell-tag=CB --mapping-quality=255 --gene-tag=GN --skip-tags-regex=; --method=unique --per-cell -I neurons_900_possorted_genome_bam.bam -S counts.tsv.gz
# job started at Wed Sep 19 20:44:31 2018 on compute-a-16-165.o2.rc.hms.harvard.edu -- eb361b58-51f6-4ff6-aa62-699d3d743044
# pid: 171613, system: Linux 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9 14:09:15 UTC 2015 x86_64
# cell_tag                                : CB
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chrom                                   : None
# compresslevel                           : 6
# gene_tag                                : GN
# gene_transcript_map                     : None
# get_umi_method                          : tag
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 255
# method                                  : unique
# paired                                  : False
# per_cell                                : True
# per_contig                              : False
# per_gene                                : True
# random_seed                             : None
# short_help                              : None
# skip_regex                              : ;
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='neurons_900_possorted_genome_bam.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='counts.tsv.gz' encoding='ascii'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# umi_sep                                 : _
# umi_tag                                 : UB
# umi_tag_delim                           : None
# umi_tag_split                           : None
# wide_format_cell_counts                 : False
Traceback (most recent call last):
  File "/home/dhb13/anaconda2/envs/umi/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 59, in main
    module.main(sys.argv)
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/count.py", line 110, in main
    for bundle, key, status in bundle_iterator(inreads):
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 1489, in __call__
    umi, cell = self.barcode_getter(read)
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 480, in get_barcode_tag
    cell = read.get_tag(cell_tag)
  File "pysam/libcalignedsegment.pyx", line 2383, in pysam.libcalignedsegment.AlignedSegment.get_tag
  File "pysam/libcalignedsegment.pyx", line 2425, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'CB' not present"

Do you think the right solution is to filter the bam file manually (via samtools and grep, for instance), or is it okay to encapsulate the KeyError that was giving the error in umi_methods.py like so:

if self.options.ignore_umi:
    if self.options.per_cell:
        umi, cell = self.barcode_getter(read)
        umi = ""
    else:
        umi, cell = "", ""
else:
    try:
        umi, cell = self.barcode_getter(read)
    except KeyError:
        self.read_events['Read skipped, no tag'] += 1
        continue

Secondly, if for umi_tools count I also wanted to exclude multimapped reads, as is possible in dedup with the multimapping-detection-method, what do you suggest is the best way to do so? Is it just simpler to run the whole umi-tools pipeline from the 10x fastq files rather than trying to use the 10x bam files?

Thanks in advance, David

Issue Analytics

State:
Created 5 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

davidhbranncommented, Sep 26, 2018

Hi,

Thanks for the fix.

Can you elaborate what you mean about not using 10X BAM files as input for umi_tools? Just for consistency I want to have the same barcode correction procedure as if I ran the entire 10x pipeline but recount the bam file after removing certain reads. In this case I would only use the umi_tools count command.

In other instances, I plan to run the entire umi_tools pipeline, as in the examples.

1reaction

TomSmithCGATcommented, Sep 26, 2018

Hi David. I’d be happy to catch the KeyError and log as you suggest. What do you think @IanSudbery?

WRT multimapping reads, our recommendation is to remove this from the BAM beforehand. However, I would caution against using 10X BAM files as input for umi_tools since the UMIs will have been corrected by cellranger which invalidates the assumptions behind the umi_tools deduplication. So, yes, much better to start from the 10X fastqs.