question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exception: ValueError('buffer source array is read-only')

See original GitHub issue

(Comes from https://github.com/dask/distributed/issues/1978#issuecomment-645869748)

What happened:

$ ipython
Python 3.8.3 (default, May 20 2020, 12:50:54) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: # coding: utf-8 
   ...: from dask import dataframe as dd 
   ...: import pandas as pd 
   ...: from distributed import Client 
   ...: client = Client() 
   ...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"]) 
   ...: payment_types = { 
   ...:     1: "Credit Card", 
   ...:     2: "Cash", 
   ...:     3: "No Charge", 
   ...:     4: "Dispute", 
   ...:     5: "Unknown", 
   ...:     6: "Voided trip" 
   ...: } 
   ...: payment_names = pd.Series( 
   ...:     payment_types, name="payment_name" 
   ...: ).to_frame() 
   ...: df2 = df.merge( 
   ...:     payment_names, left_on="payment_type", right_index=True 
   ...: ) 
   ...: op = df2.groupby("payment_name")["tip_amount"].mean() 
   ...: client.compute(op) 
   ...:                                                                                                                                                                                                                                       
Out[1]: <Future: pending, key: finalize-85edcc1f23785545f628c932abd19768>

In [2]: distributed.worker - WARNING -  Compute Failed                                                                                                                                                                                        
Function:  _apply_chunk
args:      (        VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance  RatecodeID store_and_fwd_flag  ...  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  payment_name
0              1  2019-01-04 14:08:46   2019-01-04 14:18:10                1           1.70           1                  N  ...      0.5         0.0          0.00                    0.3          9.30                   NaN          Cash
1              1  2019-01-04 14:20:33   2019-01-04 14:25:10                1           0.90           1                  N  ...      0.5         0.0          0.00                    0.3          6.30                   NaN          Cash
13             2  2019-01-04 14:14:45   2019-01-04 14:26:00                5           1.63           1                  N  ...      0.5         0.0          0.00                    0.3          9.80                   NaN          Cash
15             2  2019-01-04 14:49:45   2019-01-04 15:0
kwargs:    {'chunk': <methodcaller: sum>, 'columns': 'tip_amount'}
Exception: ValueError('buffer source array is read-only')

In [2]:                                                                                                                                                                                                                                       

In [2]: client                                                                                                                                                                                                                                
Out[2]: <Client: 'tcp://127.0.0.1:33689' processes=4 threads=4, memory=16.70 GB>

In [3]: _1                                                                                                                                                                                                                                    
Out[3]: <Future: error, key: finalize-85edcc1f23785545f628c932abd19768>

What you expected to happen: The operation finishes without error.

Minimal Complete Verifiable Example:

# coding: utf-8
from dask import dataframe as dd
import pandas as pd
from distributed import Client
client = Client()
df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
payment_types = {
    1: "Credit Card",
    2: "Cash",
    3: "No Charge",
    4: "Dispute",
    5: "Unknown",
    6: "Voided trip"
}
payment_names = pd.Series(
    payment_types, name="payment_name"
).to_frame()
df2 = df.merge(
    payment_names, left_on="payment_type", right_index=True
)
op = df2.groupby("payment_name")["tip_amount"].mean()
client.compute(op)

Data:

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv

Anything else we need to know?: I managed to avoid this error by reducing the number of files, but then it hit me again at a later point. I expect this behavior to be dependent on the available RAM.

Environment:

  • Dask version: 2.18.1
  • Python version: 3.8.3
  • Operating System: Linux Mint 19.3
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jakirkhamcommented, Jul 14, 2020

Thanks Juan! 😀

Will look at cleaning it up and adding some more tests.

1reaction
astrojuanlucommented, Jul 14, 2020

I confirm that #3918 fixed the issue with the same data:

In [1]: from dask import dataframe as dd 
   ...: from distributed import Client 
   ...:  
   ...: client = Client() 
   ...:  
   ...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"]) 
   ...:  
   ...: op = df.groupby("payment_type")["tip_amount"].mean() 
   ...: client.compute(op)                                                                                                                                                                                                                    
Out[1]: <Future: pending, key: finalize-d2d79ddf9a418b1c0ed76bfa1c20daf6>

In [2]: _1                                                                                                                                                                                                                                    
Out[2]: <Future: finished, type: pandas.Series, key: finalize-d2d79ddf9a418b1c0ed76bfa1c20daf6>

In [3]: _1.result()                                                                                                                                                                                                                           
Out[3]: 
payment_type
1    2.976392
2    0.000326
3    0.625695
4   -0.010395
5    0.000000
Name: tip_amount, dtype: float64
Read more comments on GitHub >

github_iconTop Results From Across the Web

Tackle "ValueError: buffer source array is read-only" #1978
array with a distributed scheduler is the dreaded ValueError: buffer source array is read-only . This error is typical when one runs a ......
Read more >
Pandas ValueError: buffer source array is read-only
This is a bug in the latest release of pandas (0.23.x) and will be solved in pandas 0.24+. This issue was reported already...
Read more >
Typed Memoryviews — Cython 3.0.0a11 documentation
Typed memoryviews allow efficient access to memory buffers, such as those underlying NumPy arrays, without incurring any Python overhead. Memoryviews are ...
Read more >
buffer source array is read-only with ds.map_batches and ...
I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is ......
Read more >
Apache Arrow in PySpark — PySpark 3.2.0 documentation
Typically, you would see the error ValueError: buffer source array is read-only . Newer versions of Pandas may fix these errors by improving...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found