Understanding normalization and log transformation
See original GitHub issue- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of scanpy.
- (optional) I have confirmed this bug exists on the master branch of scanpy.
This is probably a bug in my thinking, but naively I thought that sc.pp.normalize_total()
normalizes counts per cell, thus allowing comparison of different cells by correcting for variable sequencing depth. However, the log transformation applied after normalisation seems to upset this relationship, example below. Why is this not problematic?
Incidentally, I first noticed this on my real biological dataset, not the toy example below.
Edit: relevant paper
We can show, mathematically, that if we normalize expression profiles to have the same mean across cells, the mean after the equation [log] transformation used for RNA-Seq data will not be the same, and it will depend on the detection rate…
And this one:
One issue of particular interest is that the mean of the log-counts is not generally the same as the log-mean count [1]. This is problematic in scRNA-seq contexts where the log-transformation is applied to normalized expression data.
Minimal code sample
>>> from anndata import AnnData
>>> import scanpy as sc
>>> import numpy as np
>>> adata = AnnData(np.array([[3, 3, 3, 6, 6],[1, 1, 1, 2, 2],[1, 22, 1, 2, 2], ]))
>>> X_norm = sc.pp.normalize_total(adata, target_sum=1, inplace=False)['X']
>>> X_norm_log = np.log1p(X_norm)
>>> X_norm_again = np.expm1(X_norm_log)
>>> adata.X.sum(axis=1)
array([21., 7., 28.], dtype=float32) # Different counts for each cell
>>> X_norm.sum(axis=1)
array([1., 1., 1.], dtype=float32) # Normalisation means same counts for each cell
>>> X_norm_log.sum(axis=1)
array([0.90322304, 0.90322304, 0.7879869 ], dtype=float32) # <<< Interested in this! Different counts for each cell
>>> X_norm_again.sum(axis=1)
array([1., 1., 1.], dtype=float32) # Counts the same again
Versions
I’m not using the latest scanpy and anndata verions, but i don’t think this will be different on the master branch
sc.logging.print_versions() scanpy==1.4.5.post2 anndata==0.6.22.post1 umap==0.3.10 numpy==1.18.1 scipy==1.2.1 pandas==1.0.1 scikit-learn==0.22.1 statsmodels==0.11.0 python-igraph==0.8.0
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Hey @chris-rands,
This is a really interesting topic. Sorry in advance for the wordy reply… You are absolutely correct that log transformation removes the perfect comparison of relative expression values that mean normalization provides. Aside from CPM normalization (as provided by
sc.pp.normalize_total()
) not being a good normalization technique anyway (this is argued by any more advanced normalization methods paper, e.g., the scran pooling paper), there are a couple of things to consider here:For the first question: relative gene expression values ignore differences in cell sizes/number of molecules in the cell. There are some molecules whose numbers scale with the size of the cell, and others that don’t (e.g., many housekeeping genes). Choosing relative over absolute expression values to compare gene expression across cells would be helpful to compare expression of those genes that scale with size, but not the others… so there’s not really a perfect answer here. Thus, removing all effects of total counts may not be the desirable outcome.
Secondly, many downstream methods assume normally distributed expression data (e.g., DE methods like: t-tests, limma, MAST, or several batch correction/data integration methods). Log transformation is used as a variance stabilization to approximate a normal distribution (quite often poorly, but better than without). This leads to many methods performing better with log transformation.
IMO, the ideal approach is probably something like scVI, GLMPCA, or scTransform, where you fit a model directly to the count data and use the residuals to describe the data. This would address both steps of normalization and variance stabilization at the same time. If we have a good model to describe the data, the residuals should quantify the biological variance + normally distributed noise.
Overall, I would use other normalization approaches than CPM, and use log-transformation with anything that uses size factors that scale per-cell expression values.
Note also that the effect described in the second paper you mention (from Aaron Lun) will mainly be relevant when you have biased distributions of sequencing depth between two samples that you are comparing. If the size factors are similarly distributed between both conditions, then the DE effect will not be so dramatic (as far as I understood it anyway).
An example with real data:
Code: