question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: implement pd.Series.corr(method="distance")

See original GitHub issue

Distance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas. For one, these measures have the unique property that two random variables $X$ and $Y$ are independent if and only if their distance correlation is zero, which cannot be said of Pearson, Spearman or Kendall.

The below code is an implementation in pure numpy (which could certainly be optimized / more elegantly written) that could be part of the Series class and then called within corr. Later it could be integrated seamlessly with corrwith, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.

# self and other can be assumed to be aligned already
def nandistcorr(self, other):
    n = len(self)
    a = np.zeros(shape=(n, n))
    b = np.zeros(shape=(n, n))

    for i in range(n):
        for j in range(i+1, n):
            a[i, j] = abs(self[i] - self[j])
            b[i, j] = abs(other[i] - other[j])

    a = a + a.T
    b = b + b.T

    a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
    b_bar = np.vstack([np.nanmean(b, axis=0)] * n)

    A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
    B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())

    cov_ab = np.sqrt(np.nansum(A * B)) / n
    std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
    std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)

    return cov_ab / std_a / std_b

Here’s an example that shows how distance correlation can detect relationships that the other common correlation methods miss:

import numpy as np
import pandas as pd
np.random.seed(2357)

s1 = pd.Series(np.random.randn(1000))
s2 = s1**2

s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Aug 19, 2018

There is the pandas mailing list.

0reactions
nickcoronacommented, Jul 28, 2020

Why closed?

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.Series.corr — pandas 1.5.2 documentation
Compute correlation with other Series, excluding missing values. The two Series objects are not required to be the same length and will be...
Read more >
Python | Pandas Series.corr() - GeeksforGeeks
Now we will use Series.corr() function to find the correlation between the underlying data of the given series object with the others.
Read more >
ENH: Implement categorical correlation · Issue #34528 - GitHub
Is your feature request related to a problem? Currently, whenever I want to visualize features correlation I'm using df.corr(), ...
Read more >
Use .corr to get the correlation between two columns
In the graphic you show, only the upper left corner of the correlation matrix ... import numpy as np import pandas as pd...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found