question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing CB or UB with umi-tools count

See original GitHub issue

Hi, thanks for the useful software!

I’m trying to recount a bam file generated via 10x, but I’m getting errors since my bam file does not have the 10x-corrected cell-barcode or umi-barcode tag for every read (CB and UB, respectively)

FILE="neurons_900_possorted_genome_bam"
umi_tools count \
--extract-umi-method="tag" \
--per-gene \
--per-cell \
--umi-tag=UB \
--cell-tag=CB \
--mapping-quality=255 \
--gene-tag=GN \
--skip-tags-regex=";" \
--method="unique" \
--per-cell -I $FILE.bam -S counts.tsv.gz
# output generated by count --extract-umi-method=tag --per-gene --per-cell --umi-tag=UB --cell-tag=CB --mapping-quality=255 --gene-tag=GN --skip-tags-regex=; --method=unique --per-cell -I neurons_900_possorted_genome_bam.bam -S counts.tsv.gz
# job started at Wed Sep 19 20:44:31 2018 on compute-a-16-165.o2.rc.hms.harvard.edu -- eb361b58-51f6-4ff6-aa62-699d3d743044
# pid: 171613, system: Linux 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9 14:09:15 UTC 2015 x86_64
# cell_tag                                : CB
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chrom                                   : None
# compresslevel                           : 6
# gene_tag                                : GN
# gene_transcript_map                     : None
# get_umi_method                          : tag
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 255
# method                                  : unique
# paired                                  : False
# per_cell                                : True
# per_contig                              : False
# per_gene                                : True
# random_seed                             : None
# short_help                              : None
# skip_regex                              : ;
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='neurons_900_possorted_genome_bam.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='counts.tsv.gz' encoding='ascii'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# umi_sep                                 : _
# umi_tag                                 : UB
# umi_tag_delim                           : None
# umi_tag_split                           : None
# wide_format_cell_counts                 : False
Traceback (most recent call last):
  File "/home/dhb13/anaconda2/envs/umi/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 59, in main
    module.main(sys.argv)
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/count.py", line 110, in main
    for bundle, key, status in bundle_iterator(inreads):
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 1489, in __call__
    umi, cell = self.barcode_getter(read)
  File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 480, in get_barcode_tag
    cell = read.get_tag(cell_tag)
  File "pysam/libcalignedsegment.pyx", line 2383, in pysam.libcalignedsegment.AlignedSegment.get_tag
  File "pysam/libcalignedsegment.pyx", line 2425, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'CB' not present"

Do you think the right solution is to filter the bam file manually (via samtools and grep, for instance), or is it okay to encapsulate the KeyError that was giving the error in umi_methods.py like so:

if self.options.ignore_umi:
    if self.options.per_cell:
        umi, cell = self.barcode_getter(read)
        umi = ""
    else:
        umi, cell = "", ""
else:
    try:
        umi, cell = self.barcode_getter(read)
    except KeyError:
        self.read_events['Read skipped, no tag'] += 1
        continue

Secondly, if for umi_tools count I also wanted to exclude multimapped reads, as is possible in dedup with the multimapping-detection-method, what do you suggest is the best way to do so? Is it just simpler to run the whole umi-tools pipeline from the 10x fastq files rather than trying to use the 10x bam files?

Thanks in advance, David

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
davidhbranncommented, Sep 26, 2018

Hi,

Thanks for the fix.

Can you elaborate what you mean about not using 10X BAM files as input for umi_tools? Just for consistency I want to have the same barcode correction procedure as if I ran the entire 10x pipeline but recount the bam file after removing certain reads. In this case I would only use the umi_tools count command.

In other instances, I plan to run the entire umi_tools pipeline, as in the examples.

1reaction
TomSmithCGATcommented, Sep 26, 2018

Hi David. I’d be happy to catch the KeyError and log as you suggest. What do you think @IanSudbery?

WRT multimapping reads, our recommendation is to remove this from the BAM beforehand. However, I would caution against using 10X BAM files as input for umi_tools since the UMIs will have been corrected by cellranger which invalidates the assumptions behind the umi_tools deduplication. So, yes, much better to start from the 10X fastqs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Single cell tutorial — UMI-tools documentation - Read the Docs
Count unique reads per genes per cell, BAM, Counts.txt ... UMI-Tools whitelist command is used to produce a list of CB to use...
Read more >
SCAPTURE: a deep learning-embedded pipeline ... - bioRxiv
The re-assigned BAM was used to calculate UMI counts at single-cell resolution for PAS-based transcript using UMI-tools counts (parameter: -- ...
Read more >
scSNV: accurate dscRNA-seq SNV co-expression analysis ...
Our pileup method emits sparse SNV count matrices to minimize disk and ... NH HI nM AS CR UR CB UB GX GN...
Read more >
STAR 2.7.10a - CQLS Software Update List
Changed Solo BAM tags GX GN behavior: for missing values, “-” is output instead of ... Fixed a bug that resulted in slightly...
Read more >
(PDF) SCAPTURE: a deep learning-embedded pipeline that ...
UMI counts at single-cell resolution for PAS-based transcript using UMI-tools counts. (parameter: --extract-umi-method=tag --umi-tag=UB ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found