question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange results on 10X hgmm10k_v3 dataset

See original GitHub issue

Hi,

While CellBender works as expected on 10X hgmm12k (v2), on 10X hgmm10k (v3), it strangely removes large mouse gene counts and adds large human gene counts to mouse cells. 10X hgmm5k (v3) gives similar unexpected results as hgmm10k (v3). Please see logs and plots (hgmm12k and hgmm10k only) below:

hgmm12k, v2

  1. Log:
cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5 --output data/cellbender/hgmm_12k_raw_gene_bc_matrices_h5.cellbender.h5 --expected-cells 12000 --total-droplets-included 22000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 12:36:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5
cellbender:remove-background: CellRanger v2 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 199
cellbender:remove-background: Prior on counts for cells is 13864
cellbender:remove-background: Excluding barcodes with counts below 159
cellbender:remove-background: Using 12000 probable cell barcodes, plus an additional 10000 barcodes, and 48062 empty droplets.
  1. Elbow plot, vertical line marks --expected-cells and --total-droplets-included: image

  2. Before correction (called cells): image

  3. After correction (called cells): image

  4. Convergence: image

hgmm10k, v3

  1. Log:
cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5 --output data/cellbender/hgmm_10k_v3_raw_feature_bc_matrix.cellbender.h5 --expected-cells 10000 --total-droplets-included 20000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 09:31:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 444
cellbender:remove-background: Prior on counts for cells is 19036
cellbender:remove-background: Excluding barcodes with counts below 355
cellbender:remove-background: Using 10000 probable cell barcodes, plus an additional 10000 barcodes, and 56957 empty droplets.
  1. Elbow plot, vertical line marks --expected-cells and --total-droplets-included: image

  2. Before correction (called cells): image

  3. After correction (called cells): image

  4. Convergence: image

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
sjflemingcommented, Feb 6, 2020

This run might not have totally converged, but this is the result of running

cellbender remove-background --input 10k_hgmm_v3_nextgem_raw_feature_bc_matrix.h5 --output 10k_hgmm_v3_nextgem_out.h5 --cuda --expected-cells 10000 --total-droplets-included 20000 --epochs 300 --z-dim 20 --z-layers 100

image

image

image

1reaction
sjflemingcommented, Feb 5, 2020

Found it. It was coming from the use of the datatype uint16 to store gene indices during the creation of the output sparse count matrix… I guess at some point way back, I thought, “There won’t be transcriptomes with more than 65k genes, right?” Not right.

I will push a fix for this soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Datasets - 10x Genomics
Data for the Tutorial: Capturing Neutrophils in 10x Single Cell Gene Expression Data. Neutrophils are the most abundant cell type in human white...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found