Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

discrete branch: add a compelling example of discretization's benefits

See original GitHub issue

We recently merged a discretizing transformer into the discrete branch (see diff between that branch and master). Before merging it into master, we’d like a compelling example for our example gallery showing an application of machine learning where discretized features are particularly useful.

To dear contributor: Make sure to submit a pull request to the discrete branch.

Issue Analytics

State:
Created 6 years ago
Comments:20 (19 by maintainers)

Top GitHub Comments

1reaction

qinhanmin2014commented, Sep 6, 2017

@jnothman Regret me if the example is not good since I’m not an expert at machine learning 😃 The score is averaged over folds.

DecisionTree score before discretization : 0.946666666667
DecisionTree score std before discretization : 0.04
DecisionTree score after discretization : 0.96
DecisionTree score std after discretization : 0.0326598632371
SVC score before discretization : 0.96
SVC score std before discretization : 0.0249443825785
SVC score after discretization : 0.966666666667
SVC score std after discretization : 0.0249443825785

Since our discretization is naive, we cannot expect big improve. The experiment is designed mainly based on this paper (citation > 2000) and other materails. Here is part of the main code:

iris = load_iris()
X = iris.data
y = iris.target
X = X[:, [2,3]]
Xt = KBinsDiscretizer(n_bins=10, encode='ordinal').fit_transform(X)
clf1 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score before discretization : {}"
      .format(np.mean(cross_val_score(clf1, X, y, cv=5))))
print("DecisionTree score std before discretization : {}"
      .format(np.std(cross_val_score(clf1, X, y, cv=5))))
clf2 = DecisionTreeClassifier(random_state=0)
print("DecisionTree score after discretization : {}"
      .format(np.mean(cross_val_score(clf2, Xt, y, cv=5))))
print("DecisionTree score std after discretization : {}"
      .format(np.std(cross_val_score(clf2, Xt, y, cv=5))))

1reaction

qinhanmin2014commented, Sep 6, 2017

@jnothman (Sorry for the repeatedly update) Here is my plan for the example, please have a look. Thanks 😃 Dataset: iris (only use two features) (1)plot the data before and after discretization index (2)train a classifier using the data before and after discretization and compare the result

DecisionTree score before discretization : 0.946666666667
DecisionTree score after discretization : 0.96
SVC score before discretization : 0.96
SVC score after discretization : 0.966666666667

Top Results From Across the Web

Discretization Method - an overview | ScienceDirect Topics

In future work, we will investigate non-equidistant discretizations. In regions of smaller gradient values, lower numbers of discrete points or finite elements ...

Discretization: An Enabling Technique

Discretization is a process of quantizing continuous attributes. The success of discretization can significantly extend the borders of many learning algorithms.

Supervised and Unsupervised Discretization of Continuous ...

Many supervised machine learning algo- rithms require a discrete feature space. In this paper, we review previous work on con-.

An Introduction to Discretization Techniques for Data Scientists

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form.

(PDF) Discretization: An Enabling Technique - ResearchGate

Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable ...