Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DaskML vs. Sklearn LogisticRegression. Coefficients seem different.

See original GitHub issue

The coefficients from the daskml.linear_model.LogisticRegression seem different from the sklearn.linear_model.LogisticRegression.

The script below demonstrates a benchmark that shows this.

# %pip install dask_ml memo
import time
import numpy as np
from dask_ml.datasets import make_classification
from dask_glm.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000, chunksize=1_000):
    X, y = make_classification(n_samples=size, n_features=20, n_informative=10, chunksize=chunksize)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_diff': np.abs(mod_sklearn.coef_ - mod_dask.coef_).sum() # This is what I'm interested in understanding.
    }

for size in [1000, 2000]:
    for chunksize in [500, 1000]:
        run_experiment(size=size, chunksize=chunksize)

The output in data_hindsight looks like this:

{'size': 1000, 'chunksize': 500, 'time_dask': 1.2568867206573486, 'time_sklearn': 0.003863811492919922, 'acc_dask': 0.795, 'acc_sklearn': 0.795, 'coef_diff': 7.47177509238394}
{'size': 1000, 'chunksize': 1000, 'time_dask': 0.09700465202331543, 'time_sklearn': 0.005501270294189453, 'acc_dask': 0.784, 'acc_sklearn': 0.784, 'coef_diff': 8.645457699298978}
{'size': 2000, 'chunksize': 500, 'time_dask': 2.3598458766937256, 'time_sklearn': 0.007267951965332031, 'acc_dask': 0.757, 'acc_sklearn': 0.7605, 'coef_diff': 5.962742662561093}
{'size': 2000, 'chunksize': 1000, 'time_dask': 1.5501070022583008, 'time_sklearn': 0.010750293731689453, 'acc_dask': 0.7985, 'acc_sklearn': 0.7995, 'coef_diff': 8.450044499024445}

Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the coef_diff is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there’s gotta some difference here, but the difference seem too big for my gut feeling.

Environment:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

dask==2021.9.1
dask-glm==0.2.0
dask-ml==1.9.0
scikit-learn==0.24.2

Issue Analytics

State:
Created 2 years ago
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

stsievertcommented, Mar 1, 2022

I’d love to see a PR for that! I think it’d be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn’s solution (and to be frank I’m surprised that test it missing!).

1reaction

bingokocommented, Oct 12, 2021

I am facing a similar problem. Are you trying it on multiple classes? Seems like LogisticRegression dask-ml doesn’t work correctly on multiple classes classification.

Top Results From Across the Web

Why does SKLearn's Logistic Regression model have the ...

The constant is arbitrary and may alone be a reason for the difference, but there are many other differences if you look at...

logistic regression - Different results from scikit-learn and dask ...

Dask-ml, as of version dask_ml==1.0.0 , doesn't support logistic regression with multiple classes. Using a slightly modified version of your ...

Chapter 10: Machine learning with Dask-ML - Data Science ...

In this chapter, we'll have a look at the last major API of Dask: Dask-ML. ... The C coefficient and penalty parameter in...

Custom Machine Learning Estimators at Scale on Dask ...

In banks and other financial institutions, models must go through a ... PyData community. dask-ml is a library of scikit-learn extensions ...

dask_ml.linear_model.LogisticRegression - Dask-ML

Esimator for logistic regression. Parameters. penaltystr or Regularizer, default 'l2'. Regularizer to use. Only relevant for the 'admm', 'lbfgs' and ...