LinearRegression Exception: "AttributeError: 'tuple' object has no attribute 'shape'"
See original GitHub issueWhat happened: In order to use dask-ml models to train on a dask DataFrame, the DataFrame must be converted to a dask array.
#df is a dask DataFrame that looks something like this:
#
# flower.petal_length flower.petal_width flower.sepal_length
# npartitions=73
# float64 float64 float64
# ... ... ...
# ... ... ... ...
# ... ... ...
# ... ... ...
# Dask Name: astype, 1185 tasks
train_array = df.to_dask_array(lengths=True)
train_labels_array = df.to_dask_array(lengths=True)
# If I simply do a train_array.compute() or train_labels_array.compute() here & comment out .fit(), there are no problems
dask_model = LinearRegression()
dask_model.fit(train_array, train_labels_array)
When calling .fit on LinearRegression or LogisticRegression, I’m receiving the following output from the dask cluster:
distributed.worker - WARNING - Compute Failed
Function: subgraph_callable
args: (('rechunk-merge-ff32808be3096a08028f7c8aa2a4bae3', 66, 0))
kwargs: {}
Exception: AttributeError("'tuple' object has no attribute 'shape'",)
Followed by the following exception being thrown:
File "my_model.py", line 89, in model
dask_model.fit(train_array, train_labels_array)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask_ml/linear_model/glm.py", line 187, in fit
self._coef = algorithms._solvers[self.solver](X, y, **solver_kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask_glm/utils.py", line 17, in normalize_inputs
mean, std = da.compute(X.mean(axis=0), X.std(axis=0))
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/base.py", line 567, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/client.py", line 2676, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/client.py", line 1991, in gather
asynchronous=asynchronous,
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/client.py", line 832, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/distributed/client.py", line 1850, in _gather
raise exception.with_traceback(traceback)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/optimization.py", line 963, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/utils.py", line 30, in apply
return func(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/array/reductions.py", line 594, in mean_chunk
n = numel(x, dtype=dtype, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/dask/array/reductions.py", line 558, in numel
shape = x.shape
AttributeError: 'tuple' object has no attribute 'shape'
What you expected to happen:
Minimal Complete Verifiable Example: Unfortunately, I’m unable to come up with a minimal example that replicates the existing behavior. The DataFrame above is going through several steps before arriving at the point of being converted to an array (being published in a cluster, having the index reset, potentially undergoing several transformations via task_graphs and delayed(func) to .compute() calls). I’m including a basic example of what’s happening above to demonstrate, but when doing this in a dask LocalCluster separate from the other environment I don’t see the same issue.
Environment:
- Dask version: 2.10.1
- Dask-ml version: 1.7.0
- dask-glm version: 0.2.0
- sklearn version: 0.23.2
- Python version: 3.6.8
- Operating System: Centos7
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Hi @Jreyno40,
I had a similar error with a very basic example: https://examples.dask.org/dataframes/01-data-access.html Specifically, at the two lines:
I end up with the a stack trace similar to yours, but the cause is a bit different “AttributeError: ‘tuple’ object has no attribute ‘head’”
Anyway, I solved it by specifying a scheduler option to the compute method and placing it in between the two lines above.
I am only starting to learn how to use Dask, but I guess this problem might be similar to yours, and the culprit could be the scheduler back end (threads vs processes) and how the data serialization works under the hood.
If you try the same example in the link I attached above, but only removing the client option altogether, or changing the processes=False to processes=True, then you don’t need the “df = df.compute(scheduler=‘threads’)” line at all and the examples works just fine.
Again, maybe I did some naive mistake as a new user of Dask, but I hope it helps you fix the error you encountered.
As an update:
I tried upgrading the versions of dask==2022.2.0 and dask-ml==2022.5.27 on the client and workers and re-running my code. I no longer receive the attribute error mentioned above but now I receive a different error on the fitting of the logistic regression: “ValueError: shapes (0,) and (739,) not aligned: 0 (dim 0) != 739 (dim 0)”. I have printed out the shapes of both arrays before fitting and their shapes do align and neither of them have zero dimensions.
It seems like there might be some sort of communication issue with the dask workers or something else that I do not understand. This code also works perfectly fine with less than 5 partitions of data.
I will try and open up a new issue for this error since it sounds like it is different but could still be related.
Here is the stack trace for this error if it is helpful: