question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`predict` fails for multioutput ensemble models with non-numeric DVs

See original GitHub issue

Description

Multioutput forest models assume that the dependent variables are numeric. Passing string DVs returns the following error:

ValueError: could not convert string to float:

I’m going to take a stab at submitting a fix today, but I wanted to file an issue to document the problem in case I’m not able to finish a fix.

Steps/Code to Reproduce

I wrote a test based on ensemble/tests/test_forest:test_multioutput which currently fails:

def check_multioutput_string(name):
    # Check estimators on multi-output problems with string outputs.

    X_train = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1], [-2, 1],
               [-1, 1], [-1, 2], [2, -1], [1, -1], [1, -2]]
    y_train = [["red", "blue"], ["red", "blue"], ["red", "blue"], ["green", "green"],
               ["green", "green"], ["green", "green"], ["red", "purple"],
               ["red", "purple"], ["red", "purple"], ["green", "yellow"],
               ["green", "yellow"], ["green", "yellow"]]
    X_test = [[-1, -1], [1, 1], [-1, 1], [1, -1]]
    y_test = [["red", "blue"], ["green", "green"], ["red", "purple"], ["green", "yellow"]]

    est = FOREST_ESTIMATORS[name](random_state=0, bootstrap=False)
    y_pred = est.fit(X_train, y_train).predict(X_test)
    assert_array_almost_equal(y_pred, y_test)

    if name in FOREST_CLASSIFIERS:
        with np.errstate(divide="ignore"):
            proba = est.predict_proba(X_test)
            assert_equal(len(proba), 2)
            assert_equal(proba[0].shape, (4, 2))
            assert_equal(proba[1].shape, (4, 4))

            log_proba = est.predict_log_proba(X_test)
            assert_equal(len(log_proba), 2)
            assert_equal(log_proba[0].shape, (4, 2))
            assert_equal(log_proba[1].shape, (4, 4))


@pytest.mark.filterwarnings('ignore:The default value of n_estimators')
@pytest.mark.parametrize('name', FOREST_CLASSIFIERS_REGRESSORS)
def test_multioutput_string(name):
    check_multioutput_string(name)

Expected Results

No error is thrown, can run predict for all ensemble multioutput models

Actual Results

ValueError: could not convert string to float: <DV class>

Versions

I replicated this error using the current master branch of sklearn (0.21.dev0).

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Dec 19, 2018

Feel free to submit a PR if you like

0reactions
jnothmancommented, Dec 20, 2018

I suspect this case is handled just fine in KNNClassifier and DecisionTreeClassifier, so we should probably handle it here… and add common tests.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Develop Multi-Output Regression Models with Python
Problem of Multioutput Regression. Regression refers to a predictive modeling problem that involves predicting a numerical value.
Read more >
1.11. Ensemble methods — scikit-learn 1.2.0 documentation
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to...
Read more >
error in predict() when using stacks in tidymodel workflow sqrt ...
I am able to successfully make predictions from several models ... Error in sqrt(getElement(new_data, col_names[i])) : non-numeric argument ...
Read more >
[2011.02829] Deep tree-ensembles for multi-output prediction
... and regression tasks, leaving multi-output prediction under-explored. ... we propose a novel deep tree-ensemble (DTE) model, ...
Read more >
What Data Scientists should know about Multi-output and Multi ...
If the prediction tasks are related (i.e., there is a correlation or covariance between output values), training a coherent multi-output model ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found