question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask_ml.decomposition.PCA: ValueError with data > 1 TB

See original GitHub issue

When I do dask_ml.decomposition.PCA().fit(x), where the array x has a size > 1 TB, I get the error ValueError: output array is read-only.

I use

dask-ml                   1.1.1
distributed               2.9.0

The script

from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask_ml.decomposition import PCA
import dask.array as da

cluster = SLURMCluster()
nb_workers = 58
cluster.scale(nb_workers)
client = Client(cluster)
client.wait_for_workers(nb_workers)

x = da.random.random((1000000, 140000), chunks=(100000, 2000))
pca = PCA(n_components=64)
pca.fit(x)

gives the error

Traceback (most recent call last):
  File "value_error.py", line 48, in <module>
    pca.fit(x)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
    self._fit(X)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 338, in _fit
    raise e
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
    singular_values,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2573, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1873, in gather
    asynchronous=asynchronous,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 768, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1729, in _gather
    raise exception.with_traceback(traceback)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py", line 516, in svd_flip
    v *= signs[:, np.newaxis]
ValueError: output array is read-only

Note that

  • If I use x = da.random.random((1000000, 130000), chunks=(100000, 2000)) (1.0 TB), the error does not appear.
  • When I look at the dashboard, the PCA seems to run fine and the error appears at the very end of the computation.
  • I temporarily fixed the error in extmath.py by changing
def svd_flip(u, v, u_based_decision=True):
    if u_based_decision:
        # columns of u, rows of v
        max_abs_cols = np.argmax(np.abs(u), axis=0)
        signs = np.sign(u[max_abs_cols, range(u.shape[1])])
        u *= signs
-        v *= signs[:, np.newaxis]
+        v_copy = np.copy(v)
+        v_copy *= signs[:, np.newaxis]
+        return u, v_copy
    else:

I think this is not a good fix because I assume that the array v is blocked by another function. Is there another way to fix the error?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Dec 12, 2019

Can you reproduce the error on your HPC?

I don’t have access to an HPC machine.

https://github.com/dask/distributed/issues/1978 does sound related. Does ensuring that all your dependencies are built against Cython 0.28 or newer fix things?

https://github.com/dask/distributed/issues/1978#issuecomment-448209604 is using PCA as well. Let’s continue the discussion over there.

0reactions
jakirkhamcommented, Dec 12, 2019

I also don’t understand why the array is readonly.

Because we send bytes over the wire and bytes are immutable.

In [1]: memoryview(b"abc").readonly                                             
Out[1]: True
Read more comments on GitHub >

github_iconTop Results From Across the Web

dask_ml.decomposition.PCA - Dask-ML
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. It uses the “tsqr” algorithm...
Read more >
How to run PCA with dask_ml. I am getting an error, "This ...
The PCA algorithm in Dask-ML is only designed for tall-and-skinny matrices. You could try using the raw SVD algorithms in dask.array.
Read more >
ADS Documentation
Release date: October 27, 2022. • Fixed a bug in PyTorchModel. The score.py failed when torch.Tensor was used as input data.
Read more >
mutableState&quot; in android jetpack compose? - t.co / Twitter
MutableState is an alternative to using LiveData or Flow . Compose does not observe any changes to this object by default and therefore...
Read more >
Python Data Analysis Third Edition - DOKUMEN.PUB
Published by Packt Publishing Ltd. Livery Place. 35 Livery Street. Birmingham. B3 2PB, UK. ISBN 978-1-78995-524- ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found