DaskML vs. Sklearn LogisticRegression. Coefficients seem different.
See original GitHub issueThe coefficients from the daskml.linear_model.LogisticRegression
seem different from the sklearn.linear_model.LogisticRegression
.
The script below demonstrates a benchmark that shows this.
# %pip install dask_ml memo
import time
import numpy as np
from dask_ml.datasets import make_classification
from dask_glm.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
from memo import grid, memlist, memfunc
data_hindsight = []
@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000, chunksize=1_000):
X, y = make_classification(n_samples=size, n_features=20, n_informative=10, chunksize=chunksize)
# Run everything in Dask
tic = time.time()
mod_dask = DaskLogReg()
mod_dask.fit(X, y)
dask_time = time.time() - tic
# Run everything in Scikit-Learn
X_np, y_np = np.asarray(X), np.asarray(y)
mod_sklearn = SklearnLogReg()
tic = time.time()
mod_sklearn.fit(X_np, y_np)
sklearn_time = time.time() - tic
# Return stats that are of interest
return {
'time_dask': dask_time,
'time_sklearn': sklearn_time,
'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
'coef_diff': np.abs(mod_sklearn.coef_ - mod_dask.coef_).sum() # This is what I'm interested in understanding.
}
for size in [1000, 2000]:
for chunksize in [500, 1000]:
run_experiment(size=size, chunksize=chunksize)
The output in data_hindsight
looks like this:
{'size': 1000, 'chunksize': 500, 'time_dask': 1.2568867206573486, 'time_sklearn': 0.003863811492919922, 'acc_dask': 0.795, 'acc_sklearn': 0.795, 'coef_diff': 7.47177509238394}
{'size': 1000, 'chunksize': 1000, 'time_dask': 0.09700465202331543, 'time_sklearn': 0.005501270294189453, 'acc_dask': 0.784, 'acc_sklearn': 0.784, 'coef_diff': 8.645457699298978}
{'size': 2000, 'chunksize': 500, 'time_dask': 2.3598458766937256, 'time_sklearn': 0.007267951965332031, 'acc_dask': 0.757, 'acc_sklearn': 0.7605, 'coef_diff': 5.962742662561093}
{'size': 2000, 'chunksize': 1000, 'time_dask': 1.5501070022583008, 'time_sklearn': 0.010750293731689453, 'acc_dask': 0.7985, 'acc_sklearn': 0.7995, 'coef_diff': 8.450044499024445}
Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the coef_diff
is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there’s gotta some difference here, but the difference seem too big for my gut feeling.
Environment:
Python implementation: CPython
Python version : 3.7.9
IPython version : 7.27.0
Compiler : GCC 9.3.0
OS : Linux
Release : 5.11.0-7614-generic
Machine : x86_64
Processor : x86_64
CPU cores : 12
Architecture: 64bit
dask==2021.9.1
dask-glm==0.2.0
dask-ml==1.9.0
scikit-learn==0.24.2
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
Why does SKLearn's Logistic Regression model have the ...
The constant is arbitrary and may alone be a reason for the difference, but there are many other differences if you look at...
Read more >logistic regression - Different results from scikit-learn and dask ...
Dask-ml, as of version dask_ml==1.0.0 , doesn't support logistic regression with multiple classes. Using a slightly modified version of your ...
Read more >Chapter 10: Machine learning with Dask-ML - Data Science ...
In this chapter, we'll have a look at the last major API of Dask: Dask-ML. ... The C coefficient and penalty parameter in...
Read more >Custom Machine Learning Estimators at Scale on Dask ...
In banks and other financial institutions, models must go through a ... PyData community. dask-ml is a library of scikit-learn extensions ...
Read more >dask_ml.linear_model.LogisticRegression - Dask-ML
Esimator for logistic regression. Parameters. penaltystr or Regularizer, default 'l2'. Regularizer to use. Only relevant for the 'admm', 'lbfgs' and ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’d love to see a PR for that! I think it’d be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn’s solution (and to be frank I’m surprised that test it missing!).
I am facing a similar problem. Are you trying it on multiple classes? Seems like LogisticRegression dask-ml doesn’t work correctly on multiple classes classification.