question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Numerical instability with small feature values

See original GitHub issue

When feature values become small but are within the range of a floating point, random forest become numerical unstable.

The following code reproduces the problem. I use the Iris data and multiply all feature values by e^-8., so that the values are smaller. I create a test split for the data and for similar data preprocessed with MinMaxScalor. I then train using random forest, print the feature importance and accuracy.

import numpy as np
from sklearn.utils.testing import assert_true, assert_false
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

iris = datasets.load_iris()
X = iris.data[:, :4]  #
X *= 1e-8
assert_true(np.isfinite(X).all())
assert_false(np.any(np.isnan(X)))

X_scaled = MinMaxScaler().fit_transform(X)
assert_true(np.isfinite(X_scaled).all())
assert_false(np.any(np.isnan(X_scaled)))

y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.33, random_state=7)

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
clf = clf.fit(X_train, y_train)
print clf.feature_importances_
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
clf = clf.fit(X_train_s, y_train_s)
print clf.feature_importances_
y_pred_s = clf.predict(X_test_s)
accuracy = accuracy_score(y_test_s, y_pred_s)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

When training on the data where all features are scaled by e^-8, random forest does not learn, it gives every feature an importance of 0 and the accuracy is similar to random guessing (28%). Using the MinMaxScalor I get an accuracy of 92% similar to the accuracy when simply training on the Iris data.

The printet output is: [ 0. 0. 0. 0.] Accuracy: 28.00% [ 0.10529244 0.03109856 0.45044076 0.41316824] Accuracy: 92.00%

To me this seems as a numerical instability problem in the sklearn framework.

Versions Windows-7-6.1.7601-SP1 (‘Python’, ‘2.7.13 |Anaconda custom (64-bit)| (default, May 11 2017, 13:17:26) [MSC v.1500 64 bit (AMD64)]’) (‘NumPy’, ‘1.12.1’) (‘SciPy’, ‘0.19.0’) (‘Scikit-Learn’, ‘0.19.0’)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Dec 5, 2017
                                                                                  This is to skip unnecessary ‎computation of the impurity which is the most costly part when trying to find a split. 
1reaction
jnothmancommented, Dec 5, 2017

I don’t think we have enough general precautions about numerical precision issues, but I’m not sure where to put it either. Certainly it should be mentioned in the scaler documentation, and I don’t think it is: doc/modules/preprocessing.rst. PR welcome.

Read more comments on GitHub >

github_iconTop Results From Across the Web

neural networks - What is numerical stability?
Numerical stability refers to how a malformed input affects the execution of an algorithm. In a numerically stable algorithm, errors in the ...
Read more >
5.4. Numerical Stability and Initialization
They might be small or large, and their product might be very large or very small. The risks posed by unstable gradients go...
Read more >
5.4 Numerical instabilities
Typically, after three or four regenerations, the parameters converge to stable values giving a small variance and low energy in subsequent VMC calculations....
Read more >
Numerical Instability | CFD-101 | Navier-Stokes Equations
Dr. CW Tony Hirt discusses numerical instability and common numerical instabilities associated with the Navier-Stokes equations.
Read more >
Numerical Stability
One feature of GEEs and the generalised linear models from which they arose is ... The :compute-step-beta method contains a check that the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found