question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Subsample by observations grouping

See original GitHub issue
  • Additional function parameters / changed functionality / changed defaults?
  • New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
  • New plotting function: A kind of plot you would like to seein sc.pl?
  • External tools: Do you know an existing package that should go into sc.external.*?
  • Other?

Related to scanpy.pp.subsample, it would be useful to have a subsampling tool that subsamples based on the key of an observations grouping. E.g., if I have an observation key ‘MyGroup’ with possible values [‘A’, ‘B’], and there are 10,000 cells of type ‘A’ and 2,000 cells of type ‘B’ and I want only max 5,000 cells of each type, then this function would subsample 5,000 cells of type ‘A’ but retain all 2,000 cells of type ‘B’.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
giovpcommented, Nov 18, 2021

I’ll reopen this cause I think it’s quite relevant still and could be very straightforward to implement with sklearn resample

also, there is an entire package for subsampling strategies which is probably quite relevant: https://github.com/scikit-learn-contrib/imbalanced-learn

line here for reference: https://github.com/theislab/scanpy/blob/48cc7b38f1f31a78902a892041902cc810ddfcd3/scanpy/preprocessing/_simple.py#L857

1reaction
LuckyMDcommented, Jan 14, 2020

Something like this should work. Note, this is not tested.

target_cells = 5000

adatas = [adata[adata.obs[cluster_key].isin(clust)] for clust in adata.obs[cluster_key].cat.categories]

for dat in adatas:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_downsampled = adatas[0].concatenate(*adatas[1:])

Hope that helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Subsample by observations grouping · Issue #987
E.g., if I have an observation key 'MyGroup' with possible values ['A', 'B'], and there are 10,000 cells of type 'A' and 2,000...
Read more >
finding means and standard deviations for subgroups - SPH
1.9 Subgroup analyses: finding means and standard deviations for subgroups. There are (at least) three ways to do subgroup analyses in R.
Read more >
Take random sample by group
I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations) ...
Read more >
Take random sample based on groups in R
SD parameter which selects a sample grouping data using the “by” parameter. The number of rows chosen from each group depends on the...
Read more >
Solved: Random sampling in different groups
I want to do random sampling for 2 samples from each group. so in this example data, I will get 8 samples from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found