dask_ml.decomposition.PCA: ValueError with data > 1 TB
See original GitHub issueWhen I do dask_ml.decomposition.PCA().fit(x)
, where the array x
has a size > 1 TB, I get the error ValueError: output array is read-only
.
I use
dask-ml 1.1.1
distributed 2.9.0
The script
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask_ml.decomposition import PCA
import dask.array as da
cluster = SLURMCluster()
nb_workers = 58
cluster.scale(nb_workers)
client = Client(cluster)
client.wait_for_workers(nb_workers)
x = da.random.random((1000000, 140000), chunks=(100000, 2000))
pca = PCA(n_components=64)
pca.fit(x)
gives the error
Traceback (most recent call last):
File "value_error.py", line 48, in <module>
pca.fit(x)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
self._fit(X)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 338, in _fit
raise e
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
singular_values,
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2573, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1873, in gather
asynchronous=asynchronous,
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 768, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
File "/home/dheim/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1729, in _gather
raise exception.with_traceback(traceback)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py", line 516, in svd_flip
v *= signs[:, np.newaxis]
ValueError: output array is read-only
Note that
- If I use
x = da.random.random((1000000, 130000), chunks=(100000, 2000))
(1.0 TB), the error does not appear. - When I look at the dashboard, the PCA seems to run fine and the error appears at the very end of the computation.
- I temporarily fixed the error in
extmath.py
by changing
def svd_flip(u, v, u_based_decision=True):
if u_based_decision:
# columns of u, rows of v
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
- v *= signs[:, np.newaxis]
+ v_copy = np.copy(v)
+ v_copy *= signs[:, np.newaxis]
+ return u, v_copy
else:
I think this is not a good fix because I assume that the array v
is blocked by another function.
Is there another way to fix the error?
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
dask_ml.decomposition.PCA - Dask-ML
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. It uses the “tsqr” algorithm...
Read more >How to run PCA with dask_ml. I am getting an error, "This ...
The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array.
Read more >ADS Documentation
Release date: October 27, 2022. • Fixed a bug in PyTorchModel. The score.py failed when torch.Tensor was used as input data.
Read more >mutableState" in android jetpack compose? - t.co / Twitter
MutableState is an alternative to using LiveData or Flow . Compose does not observe any changes to this object by default and therefore...
Read more >Python Data Analysis Third Edition - DOKUMEN.PUB
Published by Packt Publishing Ltd. Livery Place. 35 Livery Street. Birmingham. B3 2PB, UK. ISBN 978-1-78995-524- ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I don’t have access to an HPC machine.
https://github.com/dask/distributed/issues/1978 does sound related. Does ensuring that all your dependencies are built against Cython 0.28 or newer fix things?
https://github.com/dask/distributed/issues/1978#issuecomment-448209604 is using PCA as well. Let’s continue the discussion over there.
Because we send
bytes
over the wire andbytes
are immutable.