Question about group / dedup --per-gene
See original GitHub issueWhen I run the following command (or a similar dedup command):
umi_tools group -I test.bam --per-gene --gene-transcript-map='gene2rapmap.tsv' --edit-distance-threshold=2 --group-out=test.bam.grouped.tsv --output-bam -S test.group.2.bam -L test.group.2.log --
dedup command looks like this:
umi_tools dedup -I test.bam --per-gene --gene-transcript-map='gene2rapmap.tsv' --edit-distance-threshold=2 -S test.dedup.2.bam -L test.dedup.2.log --
where gene2rapmap.tsv looks like this:
ENSG00000223972.5 ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript|
ENSG00000223972.5 ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|
ENSG00000227232.5 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene|
ENSG00000278267.1 ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
ENSG00000243485.5 ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-001|MIR1302-2HG|712|lincRNA|
ENSG00000243485.5 ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-002|MIR1302-2HG|535|lincRNA|
ENSG00000284332.1 ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
ENSG00000237613.2 ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187|lincRNA|
ENSG00000237613.2 ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590|lincRNA|
ENSG00000268020.3 ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|OR4G4P-001|OR4G4P|840|unprocessed_pseudogene|
and the input bam (test.bam) looks like this:
NS500624:117:HMK7JBGX2:1:21312:21351:15400:CELL_ACGCCGACATTAACCG:UMI_GTCTTATGCA:SAMPLE_TAACAAGG:UID_TAACAAGGACGCCGACATTAACCGGTCTTATGCA 16 ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript| 163 255 98M * 0 0 CTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTCCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGC * NH:i:4
NS500624:117:HMK7JBGX2:4:12612:17207:3409:CELL_AGTAGTCCACCGATAT:UMI_GCTACTGAGT:SAMPLE_TAACAAGG:UID_TAACAAGGAGTAGTCCACCGATATGCTACTGAGT 16 ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript| 185 255 98M * 0 0 CTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTT * NH:i:4
NS500624:117:HMK7JBGX2:4:22604:24099:15539:CELL_ATCCGAAAGTGTCTCA:UMI_CGCAGGACAT:SAMPLE_TAACAAGG:UID_TAACAAGGATCCGAAAGTGTCTCACGCAGGACAT 16 ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-002|DDX11L1|1657|processed_transcript| 242 255 98M * 0 0 CCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAAC * NH:i:3
NS500624:117:HMK7JBGX2:2:11102:6369:4141:CELL_CCTCACGAGGGTCTCC:UMI_GCATTATAAG:SAMPLE_TAACAAGG:UID_TAACAAGGCCTCACGAGGGTCTCCGCATTATAAG 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 77 255 98M * 0 0 GTCAGACCTATGCCGTGCCCTTCATCCAGCCAGACCTGCGGCGAGAGGAGGCCGTCCAGCAGCTGGCGGATGCCCTGCAGTACCTGCAGAAGGTCTCT * NH:i:8
NS500624:117:HMK7JBGX2:3:22602:10740:6438:CELL_CATCAGATCATAAAGG:UMI_TCTTCTCAAC:SAMPLE_TAACAAGG:UID_TAACAAGGCATCAGATCATAAAGGTCTTCTCAAC 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 104 255 98M * 0 0 AGCCAGACCTGCGGCGAGAGGAGGCCGTCCAGCAGATGGTGGATGCCCTGCAGTACCTGCAGAAGGTCTCTGGAGCCATCTTCAGCAGCCAACAAATA * NH:i:17
NS500624:117:HMK7JBGX2:2:13207:23509:9911:CELL_TGGTTAGTCACTTACT:UMI_GACTAACAGG:SAMPLE_TAACAAGG:UID_TAACAAGGTGGTTAGTCACTTACTGACTAACAGG 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 175 255 98M * 0 0 GCAGGATCTCCCAGCGGGTAGAGCAGAGCCGGAGCCAGGTGCAGGCCATTGGAGAGAAGGTCTCCTTGGCCCAGGCCAAGATTGAGAAGATCAAGGGC * NH:i:31
NS500624:117:HMK7JBGX2:1:13104:5438:5434:CELL_CCTTCCCTGGGTCCCC:UMI_ACTAAGCCAG:SAMPLE_TAACAAGG:UID_TAACAAGGCCTTCCCTGGGTCCCCACTAAGCCAG 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 301 255 98M * 0 0 GTGCCAAGTACCCTGCTCCAGGGCGCCTGCAGGAATATGGCTCCATCTTCACGGGCGCCCAGGACCCTGGCCTGCAGAGACGCCCCCGCCACAGGGTC * NH:i:2
NS500624:117:HMK7JBGX2:3:11401:2985:6544:CELL_CCTTCCCTGGGTCCCC:UMI_ACTAAGCCAG:SAMPLE_TAACAAGG:UID_TAACAAGGCCTTCCCTGGGTCCCCACTAAGCCAG 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 396 255 98M * 0 0 ATCCAGAGCAAGCACCGCCCCCTGGACGAGCGGGCCCTGCAGGAGAAGCTGAAGGACTTTCCTGTGTGCGTGAGCACCAAGCCGGAGCCCGAGGACGA * NH:i:19
NS500624:117:HMK7JBGX2:3:23405:16647:18833:CELL_GTCACAGCATCACATG:UMI_GGTCTTTGTT:SAMPLE_TAACAAGG:UID_TAACAAGGGTCACAGCATCACATGGGTCTTTGTT 0 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 429 255 98M * 0 0 GCCCTGCAGGAGAAGCTGAAGGACTTTCCTGTGTGCGTGAGCACCAAGCCGGAGCCCGAGGACGATGCAGAAGAGGGACTTGGGGGTCTTCCCAGCAA * NH:i:19
NS500624:117:HMK7JBGX2:3:13601:6317:1535:CELL_GGCTCGAAGCTCCTTC:UMI_GTCACACAGA:SAMPLE_TAACAAGG:UID_TAACAAGGGGCTCGAAGCTCCTTCGTCACACAGA 16 ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-001|WASH7P|1351|unprocessed_pseudogene| 552 255 98M * 0 0 GGACCCCCATGTCGCCCCTGTAGGTACAAGAAGGATGTCTTCCTGGACCCCCTGGCTGGTGCTGTACCAAAGACCCATGTACTCTGCTTTGATTACAC * NH:i:27
and the reads are from 10x chromium scRNA-seq, formatted with vals/umis, quasimapped with rapmap, I get results out on the other end that I don’t quite understand. My expectation is that reads with the same gene/uid pairing will be either deduped or grouped, however I get results like this:
for dedup:
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504434.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372219.1|RNF44-004|RNF44|785|retained_intron| 592 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 728 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 751 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 831 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 481 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 504 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 584 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:2:11310:21286:17263:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 426 255 98M * 0 0 ACTAGGTGGCCACCCTCCGCCCCCGTGGGCCAGCGGCGATTCTCTGCGGGACCTGGCAGCACCCCGGGCCAGCTCTGGGGAAGCCGCCGTCCCGACCT * NH:i:1 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:3:23506:6775:14776:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 477 255 98M * 0 0 CCTGGCCGCACCCCGGGCCAGCTCTGGGGAAGCCGCCGTCCCGACCTCCCCACCTCCCCGTAGAGGAGCGCCGAGCCTCGGCTCCGGCCGGCGGGAGC * NH:i:1 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 540 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 563 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 643 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11
for group:
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504434.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372219.1|RNF44-004|RNF44|785|retained_intron| 592 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11 UG:i:8031 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 728 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8046 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 751 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11 UG:i:8047 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000274811.8|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000253156.2|RNF44-001|RNF44|4155|protein_coding| 831 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8048 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 481 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8118 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 504 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11 UG:i:8119 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000513029.5|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372217.1|RNF44-002|RNF44|1647|nonsense_mediated_decay| 584 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8120 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:2:11310:21286:17263:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 426 255 98M * 0 0 ACTAGGTGGCCACCCTCCGCCCCCGTGGGCCAGCGGCGATTCTCTGCGGGACCTGGCAGCACCCCGGGCCAGCTCTGGGGAAGCCGCCGTCCCGACCT * NH:i:1 MC:Z:ENSG00000146083.11 UG:i:8130 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:3:23506:6775:14776:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 0 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 477 255 98M * 0 0 CCTGGCCGCACCCCGGGCCAGCTCTGGGGAAGCCGCCGTCCCGACCTCCCCACCTCCCCGTAGAGGAGCGCCGAGCCTCGGCTCCGGCCGGCGGGAGC * NH:i:1 MC:Z:ENSG00000146083.11 UG:i:8134 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:4:13406:21644:1183:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 540 255 98M * 0 0 AGGAGCGCCGAGCCTCGGCTCCTGCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGTTCATGGTTGATCTCCACGAGCAGGTG * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8137 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:1:13104:7177:19118:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 563 255 98M * 0 0 GCCGGCGGGAGCCCCCGAATGCTGCACCCAGCCACCCAGCAGAGCCCGGTCATGGGGGATCTCCACGAGCAGGTGCGCCAGGGACCTGTCCCTCTGTC * NH:i:4 MC:Z:ENSG00000146083.11 UG:i:8138 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
NS500624:117:HMK7JBGX2:2:23210:22337:12217:CELL_CTTCTCTGTCACACGC:UMI_TACGCCGCTC:SAMPLE_TAACAAGG:UID_TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC 2304 ENST00000504160.1|ENSG00000146083.11|OTTHUMG00000130664.3|OTTHUMT00000372218.1|RNF44-003|RNF44|887|nonsense_mediated_decay| 643 255 98M * 0 0 GGGACCTGTCCCTCTGTCCTACACGGTCACCACAGTGACGACCCAAGGCTTCCCCTTGCCTACAGGCCAGCACATCCCTGGAAACCACGGCAACCTGT * NH:i:3 MC:Z:ENSG00000146083.11 UG:i:8139 BX:Z:TAACAAGGCTTCTCTGTCACACGCTACGCCGCTC
In both cases genes have the same gene / uid combination, however with group
, unique UG
tags are applied, and for dedup
all reads are retained. Is this expected? Am I doing something wrong?
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
We’ll be supporting per-cell counting from a single BAM very soon! We should have a working version on a branch later this week if you wanted to test it out?
I’ll have a look into that count output - not very useful having no gene names!
That seems a long run time for 70M reads. I’m guessing this is because you have some genes with many UMIs which is making the network building stage very time consuming, The time taken to build the networks is quadratic with respect to the number of UMIs and linear with respect to their length. I’d therefore expect the de-duplication/counting will be much quicker when it’s done at the cell level since the number of UMIs per cell per gene will be far fewer.
Hi @Simon-Coetzee. Could you try re-installing the branch again. We didn’t have any tests covering the per-gene option with group so this bug was missed previously. This should have been rectified now.
Am I correct in thinking you are also ultimately using the per-gene option for “read counting”? I ask because, in the long term, we’re considering removing the --per-gene options from dedup and group. As far as we’re aware, the only purpose of per-gene deduplication is to count the number of reads per gene so we have made a separate count command to acheive exactly this. Restricting the use of the different commands should help clarify the best command for the job. The count command is actually already available but we wont be properly making users aware of this command until we release version 0.5.