Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Subsample by observations grouping

See original GitHub issue

Additional function parameters / changed functionality / changed defaults?
New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
New plotting function: A kind of plot you would like to seein sc.pl?
External tools: Do you know an existing package that should go into sc.external.*?
Other?

Related to scanpy.pp.subsample, it would be useful to have a subsampling tool that subsamples based on the key of an observations grouping. E.g., if I have an observation key ‘MyGroup’ with possible values [‘A’, ‘B’], and there are 10,000 cells of type ‘A’ and 2,000 cells of type ‘B’ and I want only max 5,000 cells of each type, then this function would subsample 5,000 cells of type ‘A’ but retain all 2,000 cells of type ‘B’.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:10 (8 by maintainers)

Top GitHub Comments

1reaction

giovpcommented, Nov 18, 2021

I’ll reopen this cause I think it’s quite relevant still and could be very straightforward to implement with sklearn resample

also, there is an entire package for subsampling strategies which is probably quite relevant: https://github.com/scikit-learn-contrib/imbalanced-learn

line here for reference: https://github.com/theislab/scanpy/blob/48cc7b38f1f31a78902a892041902cc810ddfcd3/scanpy/preprocessing/_simple.py#L857

1reaction

LuckyMDcommented, Jan 14, 2020

Something like this should work. Note, this is not tested.

target_cells = 5000

adatas = [adata[adata.obs[cluster_key].isin(clust)] for clust in adata.obs[cluster_key].cat.categories]

for dat in adatas:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_downsampled = adatas[0].concatenate(*adatas[1:])

Hope that helps.