question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tackle "ValueError: buffer source array is read-only"

See original GitHub issue

One of the biggest pitfalls of running dask.array with a distributed scheduler is the dreaded ValueError: buffer source array is read-only. This error is typical when one runs a memoryview-based Cython kernel in distributed, and it’s particularly insidious as it will never show up in unit tests performed with the local dask multithreaded scheduler. It might even not show up when you run your problem on distributed and the arrays just happen to never transit across nodes or to the disk cache (which is exactly what the scheduler will try to achieve if enough RAM and CPU power are available).

In other words, this is a textbook example of an issue that risks appearing in production for the first time!

I can see a few ways of tackling the problem:

  • in Cython: if you ain’t writing to an array, it should just work. As this issue has been around for the longest time, I suspect it might not be trivial?
  • in distributed, making sure that all arrays passed to all kernels are writeable
  • in dask.array, making sure that all arrays passed to all kernels are NOT writeable, which actually makes a lot of sense regardless of distributed. This will make the error crop up immediately in any naive unit tests. It will also wreak havoc for many existing dask.array users though. Possibly a opt-in setting?
  • in the distributed docs, with a thorough tutorial on how to reproduce the problem in unit testing and how to change your kernels to fix it, so that it becomes the first result when anybody googles the exception.

On the last point, I personally solved the problem as follows:

In the kernels:

def _memoryview_safe(x):
    """Make array safe to run in a Cython memoryview-based kernel. These
    kernels typically break down with the error ``ValueError: buffer source
    array is read-only`` when running in dask distributed.
    """
    if not x.flags.writeable:
        if not x.flags.owndata:
            x = x.copy(order='C')
        x.setflags(write=True)
    return x


def splev(x_new, t, c, k=3, extrapolate=True):
    x_new = _memoryview_safe(x_new)
    t = _memoryview_safe(t)
    c = _memoryview_safe(c)
    spline = scipy.interpolate.BSpline.construct_fast(t, c, k, axis=0, extrapolate=extrapolate)
    return spline(x_new)

In the unit test:

def test_distributed():
    def ro_array(a):
        a = np.array(a)
        a.setflags(write=False)
        # Return a view of a, so that setting the write flag on the view is not enough
        return a[:]

    t = ro_array([1, 2])
    c = ro_array([10, 20])
    x_new = ro_array([1.5, 1.8])
    splev(x_new, t, c, k=1)

If you comment out any of those calls to _memoryview_safe, the test falls over. Above I’m calling the kernel directly, but a similar thing can also be invoked from the dask wrapper (probably a more robust design).

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:7
  • Comments:28 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
crusaderkycommented, Feb 28, 2020

Reproduced with stack as of May 2018; cannot reproduce with latest stack as of Feb 2020. Note how downgrading Cython was not enough to reproduce the issue; I did not investigate which package/version fixed the problem exactly.

POC

demo.pyx

import numpy


cpdef f(double[:] x):
    return numpy.array(x)


cpdef g(const double[:] x):
    return numpy.array(x)

main.py

import dask
import dask.array as da
import dask.threaded
import distributed
import numpy
import pyximport

pyximport.install()
from demo import f, g


def main():
    a1 = da.ones(4, chunks=4)
    a2 = da.from_array(numpy.ones(4), chunks=4)
    client = distributed.Client()

    for scheduler in ('threads', 'distributed'):
        if dask.__version__ < '2':
            kwargs = {"get": client.get if scheduler == "distributed" else dask.threaded.get}
        else:
            kwargs = {"scheduler": scheduler}

        for a in (a1, a2):
            for func in (f, g):
                try:
                    b = a.map_blocks(func, dtype=a.dtype).compute(**kwargs)
                    assert b.tolist() == [1, 1, 1, 1]
                    out = "OK"
                except Exception as e:
                    out = f"{type(e).__name__}: {e}"

                print(f"{scheduler}, {func.__name__}, {a.name.split('-')[0]}: {out}")


if __name__ == "__main__":
    main()

With legacy stack

$ conda create -n legacy python=3.6 cython=0.28.1 distributed=1.21.1 dask=0.17.3 numpy=1.14.3 tornado=5.0.2 clang_osx-64
$ conda activate legacy
$ python main.py 2>/dev/null
threads, f, wrapped: OK
threads, g, wrapped: OK
threads, f, array: OK
threads, g, array: OK
distributed, f, wrapped: OK
distributed, g, wrapped: OK
distributed, f, array: ValueError: buffer source array is read-only
distributed, g, array: OK

With latest stack

$ conda create -n latest python=3.6 cython dask distributed clang_osx-64
$ conda activate latest
$ python main.py 2>/dev/null
threads, f, ones: OK
threads, g, ones: OK
threads, f, array: OK
threads, g, array: OK
distributed, f, ones: OK
distributed, g, ones: OK
distributed, f, array: OK
distributed, g, array: OK
1reaction
lestevecommented, Jun 27, 2018

Support for read-only memoryviews was only recently added into Cython: cython/cython#1869. If I understand that PR correctly, this error should no longer be raised if the Cythonized function does not try to override any data.

Just a quick comment on this since I was involved in providing feed-back about the Cython PR (this read-only problem happens quite often too in a scikit-learn context, or more precisely in a joblib context which automatically memmaps inputs in read-only mode, and we were quite interested by the functionality). To benefit from the cython feature you need to add a const in your cython function signature along these lines:

cpdef func_that_can_take_read_only_array(const double[:] input_array):
    ...

There is a limitation of const memoryview at the moment: you can not use const memoryview with fused types, see https://github.com/cython/cython/issues/1772 for more details. As far as scikit-learn is concerned this is the main reason we have not moved to using const memoryviews.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas ValueError: buffer source array is read-only
This is a bug in the latest release of pandas (0.23.x) and will be solved in pandas 0.24+. This issue was reported already...
Read more >
Typed Memoryviews — Cython 3.0.0a11 documentation
Typed memoryviews allow efficient access to memory buffers, such as those underlying NumPy arrays, without incurring any Python overhead. Memoryviews are ...
Read more >
buffer source array is read-only with ds.map_batches and ...
I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is ......
Read more >
Apache Arrow in PySpark — PySpark 3.2.0 documentation
Typically, you would see the error ValueError: buffer source array is read-only . Newer versions of Pandas may fix these errors by improving...
Read more >
Mercari Price Suggestion Challenge | Kaggle
Has anyone tried incremental learning to tackle memory error? replyReply ... __cinit__ ValueError: buffer source array is read-only.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found