Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding normalization and log transformation

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the master branch of scanpy.

This is probably a bug in my thinking, but naively I thought that sc.pp.normalize_total() normalizes counts per cell, thus allowing comparison of different cells by correcting for variable sequencing depth. However, the log transformation applied after normalisation seems to upset this relationship, example below. Why is this not problematic?

Incidentally, I first noticed this on my real biological dataset, not the toy example below.

Edit: relevant paper

We can show, mathematically, that if we normalize expression profiles to have the same mean across cells, the mean after the equation [log] transformation used for RNA-Seq data will not be the same, and it will depend on the detection rate…

And this one:

One issue of particular interest is that the mean of the log-counts is not generally the same as the log-mean count [1]. This is problematic in scRNA-seq contexts where the log-transformation is applied to normalized expression data.

Minimal code sample

>>> from anndata import AnnData
>>> import scanpy as sc
>>> import numpy as np
>>> adata = AnnData(np.array([[3, 3, 3, 6, 6],[1, 1, 1, 2, 2],[1, 22, 1, 2, 2], ]))
>>> X_norm = sc.pp.normalize_total(adata, target_sum=1, inplace=False)['X']
>>> X_norm_log = np.log1p(X_norm)
>>> X_norm_again = np.expm1(X_norm_log)
>>> adata.X.sum(axis=1)
array([21.,  7., 28.], dtype=float32)  # Different counts for each cell
>>> X_norm.sum(axis=1)
array([1., 1., 1.], dtype=float32)  # Normalisation means same counts for each cell
>>> X_norm_log.sum(axis=1)
array([0.90322304, 0.90322304, 0.7879869 ], dtype=float32)  # <<< Interested in this! Different counts for each cell
>>> X_norm_again.sum(axis=1)
array([1., 1., 1.], dtype=float32)  # Counts the same again

Versions

I’m not using the latest scanpy and anndata verions, but i don’t think this will be different on the master branch

sc.logging.print_versions() scanpy==1.4.5.post2 anndata==0.6.22.post1 umap==0.3.10 numpy==1.18.1 scipy==1.2.1 pandas==1.0.1 scikit-learn==0.22.1 statsmodels==0.11.0 python-igraph==0.8.0

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

8reactions

LuckyMDcommented, Aug 21, 2020

Hey @chris-rands,

This is a really interesting topic. Sorry in advance for the wordy reply… You are absolutely correct that log transformation removes the perfect comparison of relative expression values that mean normalization provides. Aside from CPM normalization (as provided by sc.pp.normalize_total()) not being a good normalization technique anyway (this is argued by any more advanced normalization methods paper, e.g., the scran pooling paper), there are a couple of things to consider here:

Do we even want relative expression counts?
What assumptions do downstream methods have on the distribution of expression values.

For the first question: relative gene expression values ignore differences in cell sizes/number of molecules in the cell. There are some molecules whose numbers scale with the size of the cell, and others that don’t (e.g., many housekeeping genes). Choosing relative over absolute expression values to compare gene expression across cells would be helpful to compare expression of those genes that scale with size, but not the others… so there’s not really a perfect answer here. Thus, removing all effects of total counts may not be the desirable outcome.

Secondly, many downstream methods assume normally distributed expression data (e.g., DE methods like: t-tests, limma, MAST, or several batch correction/data integration methods). Log transformation is used as a variance stabilization to approximate a normal distribution (quite often poorly, but better than without). This leads to many methods performing better with log transformation.

IMO, the ideal approach is probably something like scVI, GLMPCA, or scTransform, where you fit a model directly to the count data and use the residuals to describe the data. This would address both steps of normalization and variance stabilization at the same time. If we have a good model to describe the data, the residuals should quantify the biological variance + normally distributed noise.

Overall, I would use other normalization approaches than CPM, and use log-transformation with anything that uses size factors that scale per-cell expression values.

Note also that the effect described in the second paper you mention (from Aaron Lun) will mainly be relevant when you have biased distributions of sequencing depth between two samples that you are comparing. If the size factors are similarly distributed between both conditions, then the DE effect will not be so dramatic (as far as I understood it anyway).

1reaction

chris-randscommented, Aug 14, 2020

An example with real data:

counts

Code:

# Load the PBMC 3k data
adata = sc.read_10x_mtx(
    os.path.join(
        save_path, "filtered_gene_bc_matrices/hg19/"
    ),  # the directory with the `.mtx` file
    var_names="gene_symbols",  # use gene symbols for the variable names (variables-axis index)
)
adata.var_names_make_unique()

# Get counts
adata.obs["n_counts"] = adata.X.sum(axis=1).A1
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
adata.obs["n_counts_normalized"] = adata.X.sum(axis=1).A1
sc.pp.log1p(adata)
adata.obs["n_counts_normalized_log"] = adata.X.sum(axis=1).A1

# Dim reduction
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_counts", "n_counts_normalized", "n_counts_normalized_log"])

Top Results From Across the Web

How to Differentiate Between Scaling, Normalization, and Log ...

Scale data using StandardScaler , a transformer used when we want a feature to follow a normal distribution with mean 0 and unit...

Difference between Log Transformation and Standardization

Standardization is just making the feature zero-mean and unit variance. e.g. if the feature is uniformly distributed, it'll again be uniformly ...

Log-transformation and its implications for data analysis - PMC

The log transformation, a widely used method to address skewed data, is one of the most popular transformations used in biomedical and psychosocial...

Transformations, Scaling and Normalization - Medium

Normalizing is a useful method when you know the distribution is not Gaussian. Normalization adjusts the values of your numeric data to a...

Logarithmic transformations can normalize skewed or kurtotic ...

Logarithmic transformations are used to "normalize" skewed or kurtotic distributions of continuous variables so that parametric statistics can be conducted.