Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exception: ValueError('buffer source array is read-only')

See original GitHub issue

(Comes from https://github.com/dask/distributed/issues/1978#issuecomment-645869748)

What happened:

$ ipython
Python 3.8.3 (default, May 20 2020, 12:50:54) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: # coding: utf-8 
   ...: from dask import dataframe as dd 
   ...: import pandas as pd 
   ...: from distributed import Client 
   ...: client = Client() 
   ...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"]) 
   ...: payment_types = { 
   ...:     1: "Credit Card", 
   ...:     2: "Cash", 
   ...:     3: "No Charge", 
   ...:     4: "Dispute", 
   ...:     5: "Unknown", 
   ...:     6: "Voided trip" 
   ...: } 
   ...: payment_names = pd.Series( 
   ...:     payment_types, name="payment_name" 
   ...: ).to_frame() 
   ...: df2 = df.merge( 
   ...:     payment_names, left_on="payment_type", right_index=True 
   ...: ) 
   ...: op = df2.groupby("payment_name")["tip_amount"].mean() 
   ...: client.compute(op) 
   ...:                                                                                                                                                                                                                                       
Out[1]: <Future: pending, key: finalize-85edcc1f23785545f628c932abd19768>

In [2]: distributed.worker - WARNING -  Compute Failed                                                                                                                                                                                        
Function:  _apply_chunk
args:      (        VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance  RatecodeID store_and_fwd_flag  ...  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  payment_name
0              1  2019-01-04 14:08:46   2019-01-04 14:18:10                1           1.70           1                  N  ...      0.5         0.0          0.00                    0.3          9.30                   NaN          Cash
1              1  2019-01-04 14:20:33   2019-01-04 14:25:10                1           0.90           1                  N  ...      0.5         0.0          0.00                    0.3          6.30                   NaN          Cash
13             2  2019-01-04 14:14:45   2019-01-04 14:26:00                5           1.63           1                  N  ...      0.5         0.0          0.00                    0.3          9.80                   NaN          Cash
15             2  2019-01-04 14:49:45   2019-01-04 15:0
kwargs:    {'chunk': <methodcaller: sum>, 'columns': 'tip_amount'}
Exception: ValueError('buffer source array is read-only')

In [2]:                                                                                                                                                                                                                                       

In [2]: client                                                                                                                                                                                                                                
Out[2]: <Client: 'tcp://127.0.0.1:33689' processes=4 threads=4, memory=16.70 GB>

In [3]: _1                                                                                                                                                                                                                                    
Out[3]: <Future: error, key: finalize-85edcc1f23785545f628c932abd19768>

What you expected to happen: The operation finishes without error.

Minimal Complete Verifiable Example:

# coding: utf-8
from dask import dataframe as dd
import pandas as pd
from distributed import Client
client = Client()
df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
payment_types = {
    1: "Credit Card",
    2: "Cash",
    3: "No Charge",
    4: "Dispute",
    5: "Unknown",
    6: "Voided trip"
}
payment_names = pd.Series(
    payment_types, name="payment_name"
).to_frame()
df2 = df.merge(
    payment_names, left_on="payment_type", right_index=True
)
op = df2.groupby("payment_name")["tip_amount"].mean()
client.compute(op)

Data:

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv

Anything else we need to know?: I managed to avoid this error by reducing the number of files, but then it hit me again at a later point. I expect this behavior to be dependent on the available RAM.

Environment:

Dask version: 2.18.1
Python version: 3.8.3
Operating System: Linux Mint 19.3
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

jakirkhamcommented, Jul 14, 2020

Thanks Juan! 😀

Will look at cleaning it up and adding some more tests.

1reaction

astrojuanlucommented, Jul 14, 2020

I confirm that #3918 fixed the issue with the same data:

In [1]: from dask import dataframe as dd 
   ...: from distributed import Client 
   ...:  
   ...: client = Client() 
   ...:  
   ...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"]) 
   ...:  
   ...: op = df.groupby("payment_type")["tip_amount"].mean() 
   ...: client.compute(op)                                                                                                                                                                                                                    
Out[1]: <Future: pending, key: finalize-d2d79ddf9a418b1c0ed76bfa1c20daf6>

In [2]: _1                                                                                                                                                                                                                                    
Out[2]: <Future: finished, type: pandas.Series, key: finalize-d2d79ddf9a418b1c0ed76bfa1c20daf6>

In [3]: _1.result()                                                                                                                                                                                                                           
Out[3]: 
payment_type
1    2.976392
2    0.000326
3    0.625695
4   -0.010395
5    0.000000
Name: tip_amount, dtype: float64

Top Results From Across the Web

Tackle "ValueError: buffer source array is read-only" #1978

array with a distributed scheduler is the dreaded ValueError: buffer source array is read-only . This error is typical when one runs a ......

Pandas ValueError: buffer source array is read-only

This is a bug in the latest release of pandas (0.23.x) and will be solved in pandas 0.24+. This issue was reported already...

Typed Memoryviews — Cython 3.0.0a11 documentation

Typed memoryviews allow efficient access to memory buffers, such as those underlying NumPy arrays, without incurring any Python overhead. Memoryviews are ...

buffer source array is read-only with ds.map_batches and ...

I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is ......

Apache Arrow in PySpark — PySpark 3.2.0 documentation

Typically, you would see the error ValueError: buffer source array is read-only . Newer versions of Pandas may fix these errors by improving...