Missing CB or UB with umi-tools count
See original GitHub issueHi, thanks for the useful software!
I’m trying to recount a bam file generated via 10x, but I’m getting errors since my bam file does not have the 10x-corrected cell-barcode or umi-barcode tag for every read (CB and UB, respectively)
FILE="neurons_900_possorted_genome_bam"
umi_tools count \
--extract-umi-method="tag" \
--per-gene \
--per-cell \
--umi-tag=UB \
--cell-tag=CB \
--mapping-quality=255 \
--gene-tag=GN \
--skip-tags-regex=";" \
--method="unique" \
--per-cell -I $FILE.bam -S counts.tsv.gz
# output generated by count --extract-umi-method=tag --per-gene --per-cell --umi-tag=UB --cell-tag=CB --mapping-quality=255 --gene-tag=GN --skip-tags-regex=; --method=unique --per-cell -I neurons_900_possorted_genome_bam.bam -S counts.tsv.gz
# job started at Wed Sep 19 20:44:31 2018 on compute-a-16-165.o2.rc.hms.harvard.edu -- eb361b58-51f6-4ff6-aa62-699d3d743044
# pid: 171613, system: Linux 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9 14:09:15 UTC 2015 x86_64
# cell_tag : CB
# cell_tag_delim : None
# cell_tag_split : -
# chrom : None
# compresslevel : 6
# gene_tag : GN
# gene_transcript_map : None
# get_umi_method : tag
# ignore_umi : False
# in_sam : False
# log2stderr : False
# loglevel : 1
# mapping_quality : 255
# method : unique
# paired : False
# per_cell : True
# per_contig : False
# per_gene : True
# random_seed : None
# short_help : None
# skip_regex : ;
# stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin : <_io.TextIOWrapper name='neurons_900_possorted_genome_bam.bam' mode='r' encoding='UTF-8'>
# stdlog : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='counts.tsv.gz' encoding='ascii'>
# subset : None
# threshold : 1
# timeit_file : None
# timeit_header : None
# timeit_name : all
# umi_sep : _
# umi_tag : UB
# umi_tag_delim : None
# umi_tag_split : None
# wide_format_cell_counts : False
Traceback (most recent call last):
File "/home/dhb13/anaconda2/envs/umi/bin/umi_tools", line 11, in <module>
sys.exit(main())
File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 59, in main
module.main(sys.argv)
File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/count.py", line 110, in main
for bundle, key, status in bundle_iterator(inreads):
File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 1489, in __call__
umi, cell = self.barcode_getter(read)
File "/home/dhb13/anaconda2/envs/umi/lib/python3.6/site-packages/umi_tools/umi_methods.py", line 480, in get_barcode_tag
cell = read.get_tag(cell_tag)
File "pysam/libcalignedsegment.pyx", line 2383, in pysam.libcalignedsegment.AlignedSegment.get_tag
File "pysam/libcalignedsegment.pyx", line 2425, in pysam.libcalignedsegment.AlignedSegment.get_tag
KeyError: "tag 'CB' not present"
Do you think the right solution is to filter the bam file manually (via samtools and grep, for instance), or is it okay to encapsulate the KeyError that was giving the error in umi_methods.py
like so:
if self.options.ignore_umi:
if self.options.per_cell:
umi, cell = self.barcode_getter(read)
umi = ""
else:
umi, cell = "", ""
else:
try:
umi, cell = self.barcode_getter(read)
except KeyError:
self.read_events['Read skipped, no tag'] += 1
continue
Secondly, if for umi_tools count
I also wanted to exclude multimapped reads, as is possible in dedup with the multimapping-detection-method
, what do you suggest is the best way to do so? Is it just simpler to run the whole umi-tools pipeline from the 10x fastq files rather than trying to use the 10x bam files?
Thanks in advance, David
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
Hi,
Thanks for the fix.
Can you elaborate what you mean about not using 10X BAM files as input for umi_tools? Just for consistency I want to have the same barcode correction procedure as if I ran the entire 10x pipeline but recount the bam file after removing certain reads. In this case I would only use the
umi_tools count
command.In other instances, I plan to run the entire
umi_tools
pipeline, as in the examples.Hi David. I’d be happy to catch the KeyError and log as you suggest. What do you think @IanSudbery?
WRT multimapping reads, our recommendation is to remove this from the BAM beforehand. However, I would caution against using 10X BAM files as input for
umi_tools
since the UMIs will have been corrected by cellranger which invalidates the assumptions behind theumi_tools
deduplication. So, yes, much better to start from the 10X fastqs.