question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: scipy.stats.theilslopes intercept calculation can produce incorrect results.

See original GitHub issue

Theilslopes calculates the median intercept using the calculated theilslope estimate and the median x and median y values as: np.median(y) - medslope * np.median(x). This can produce an incorrect intercept when there is a gap in the middle of the x values. Most other implementations of this algorithm use the median of all intercepts np.median(y - medslope * x), which doesn’t seem to have this issue.

The np.median(y) - medslope * np.median(x) is in the documentation, so I’m not entirely sure if there’s a good reason for doing calculating the intercept that way.

Reproducing code example:

import pandas as pd
from scipy.stats import theilslopes
from io import StringIO
import matplotlib.pyplot as plt

test_data = StringIO("""
x,y
1,1
1.000507357,1.012345679
1.001014713,1.802469136
1.00152207,1.469135802
1.002029427,1.740740741
1.002536783,1.395061728
1.00304414,1.271604938
1.017250127,2.24691358
1.017757484,1.666666667
1.01826484,3.716049383
1.018772197,4.098765432
1.019279554,3.617283951
1.01978691,2.703703704
1.020294267,2.012345679
1.020801624,1.481481481
1.02130898,1.432098765
1.021816337,1.839506173
1.022323694,2.839506173
1.02283105,2.790123457
1.023338407,2.543209877
1.023845764,3.037037037
1.02435312,3.938271605
1.024860477,3.814814815
""")

df = pd.read_csv(test_data)

slopes = theilslopes(df.y, df.x)

df['line'] = df.x * slopes[0] + slopes[1]
df['line_corr'] = df.x * slopes[0] + (df.y - slopes[0] * df.x).median()

df.plot(x='x', y=['y', 'line', 'line_corr'], label=['Data', 'Theilslope intercept', 'Corrected Intercept'])
plt.show()

sample

Scipy/Numpy/Python version information:

import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)
# 1.7.0 1.21.0 1.2.5 sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
chrisb83commented, Jul 13, 2021

@atrevisan21 thanks for the additional research on this topic.

I think the best approach would be to add a new keyword method like in stats.siegelslopes:

method='...': compute intercept as median(y) - medslope*median(x)) (default option to keep backwards compatibility) method='...': compute intercept as median(y - medslope*x)

Not sure what a good name for the arguments would be, maybe ‘separate’ in the first case since the medians are computed separately and ‘joint’ in the second case?

1reaction
atrevisan21commented, Jul 11, 2021

It seems like there are two ways to calculate the intercept both np.median(y) - np.median(x) * m and np.median(y - x * m) ( Helsel and others, 2019 pg. 268). The main benefit of finding the slopes using the median of x and the median of y is that the line will pass through point median x and median y mirroring least squares regression passing through mean x and mean y.

There doesn’t seem to be a ‘right’ way to calculate the intercept because the estimator is mostly focused around the slope according to this stack exchange discussion. The choice of intercept calculation seems more circumstantial than definitive.

It looks like the USGS program KTRLine uses the median(y) - median(x) * slope approach (Conover, 1980) and so does the R package. It may be better to leave the intercept as is to be consistent with other software programs. The user can change the calculation if needed. Maybe a note could be added to the docstring.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Potential Scipy bug in scipy.stats.mstats.theilslopes?
After comparing results from my python script to other Theil-Sen calculations, I think I've found two mistakes in the scipy.stats.mstats.
Read more >
scipy.stats.theilslopes — SciPy v1.9.3 Manual
Computes the Theil-Sen estimator for a set of points (x, y). theilslopes implements a method for robust linear regression. It computes the slope...
Read more >
SciPy 1.8.0 Release Notes — SciPy v1.9.3 Manual
SciPy 1.8.0 is the culmination of 6 months of hard work. It contains many new features, numerous bug-fixes, improved test coverage and better...
Read more >
scipy.stats.linregress — SciPy v1.9.3 Manual
Calculate a linear least-squares regression for two sets of measurements. ... Defines the alternative hypothesis. Default is 'two-sided'. The following options ...
Read more >
scipy.stats.bootstrap — SciPy v1.9.3 Manual
If vectorized is set False , statistic will not be passed keyword argument axis, and is assumed to calculate the statistic only for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found