ENH: implement pd.Series.corr(method="distance")
See original GitHub issueDistance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas
. For one, these measures have the unique property that two random variables $X$ and $Y$ are independent if and only if their distance correlation is zero, which cannot be said of Pearson, Spearman or Kendall.
The below code is an implementation in pure numpy
(which could certainly be optimized / more elegantly written) that could be part of the Series
class and then called within corr
. Later it could be integrated seamlessly with corrwith
, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.
# self and other can be assumed to be aligned already
def nandistcorr(self, other):
n = len(self)
a = np.zeros(shape=(n, n))
b = np.zeros(shape=(n, n))
for i in range(n):
for j in range(i+1, n):
a[i, j] = abs(self[i] - self[j])
b[i, j] = abs(other[i] - other[j])
a = a + a.T
b = b + b.T
a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
b_bar = np.vstack([np.nanmean(b, axis=0)] * n)
A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())
cov_ab = np.sqrt(np.nansum(A * B)) / n
std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)
return cov_ab / std_a / std_b
Here’s an example that shows how distance correlation can detect relationships that the other common correlation methods miss:
import numpy as np
import pandas as pd
np.random.seed(2357)
s1 = pd.Series(np.random.randn(1000))
s2 = s1**2
s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (7 by maintainers)
Top GitHub Comments
There is the pandas mailing list.
Why closed?