question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DaskML vs. Sklearn LogisticRegression. Coefficients seem different.

See original GitHub issue

The coefficients from the daskml.linear_model.LogisticRegression seem different from the sklearn.linear_model.LogisticRegression.

The script below demonstrates a benchmark that shows this.

# %pip install dask_ml memo
import time
import numpy as np
from dask_ml.datasets import make_classification
from dask_glm.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000, chunksize=1_000):
    X, y = make_classification(n_samples=size, n_features=20, n_informative=10, chunksize=chunksize)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_diff': np.abs(mod_sklearn.coef_ - mod_dask.coef_).sum() # This is what I'm interested in understanding.
    }

for size in [1000, 2000]:
    for chunksize in [500, 1000]:
        run_experiment(size=size, chunksize=chunksize)

The output in data_hindsight looks like this:

{'size': 1000, 'chunksize': 500, 'time_dask': 1.2568867206573486, 'time_sklearn': 0.003863811492919922, 'acc_dask': 0.795, 'acc_sklearn': 0.795, 'coef_diff': 7.47177509238394}
{'size': 1000, 'chunksize': 1000, 'time_dask': 0.09700465202331543, 'time_sklearn': 0.005501270294189453, 'acc_dask': 0.784, 'acc_sklearn': 0.784, 'coef_diff': 8.645457699298978}
{'size': 2000, 'chunksize': 500, 'time_dask': 2.3598458766937256, 'time_sklearn': 0.007267951965332031, 'acc_dask': 0.757, 'acc_sklearn': 0.7605, 'coef_diff': 5.962742662561093}
{'size': 2000, 'chunksize': 1000, 'time_dask': 1.5501070022583008, 'time_sklearn': 0.010750293731689453, 'acc_dask': 0.7985, 'acc_sklearn': 0.7995, 'coef_diff': 8.450044499024445}

Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the coef_diff is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there’s gotta some difference here, but the difference seem too big for my gut feeling.

Environment:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

dask==2021.9.1
dask-glm==0.2.0
dask-ml==1.9.0
scikit-learn==0.24.2

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
stsievertcommented, Mar 1, 2022

I’d love to see a PR for that! I think it’d be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn’s solution (and to be frank I’m surprised that test it missing!).

1reaction
bingokocommented, Oct 12, 2021

I am facing a similar problem. Are you trying it on multiple classes? Seems like LogisticRegression dask-ml doesn’t work correctly on multiple classes classification.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does SKLearn's Logistic Regression model have the ...
The constant is arbitrary and may alone be a reason for the difference, but there are many other differences if you look at...
Read more >
logistic regression - Different results from scikit-learn and dask ...
Dask-ml, as of version dask_ml==1.0.0 , doesn't support logistic regression with multiple classes. Using a slightly modified version of your ...
Read more >
Chapter 10: Machine learning with Dask-ML - Data Science ...
In this chapter, we'll have a look at the last major API of Dask: Dask-ML. ... The C coefficient and penalty parameter in...
Read more >
Custom Machine Learning Estimators at Scale on Dask ...
In banks and other financial institutions, models must go through a ... PyData community. dask-ml is a library of scikit-learn extensions ...
Read more >
dask_ml.linear_model.LogisticRegression - Dask-ML
Esimator for logistic regression. Parameters. penaltystr or Regularizer, default 'l2'. Regularizer to use. Only relevant for the 'admm', 'lbfgs' and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found