Exception: ValueError('buffer source array is read-only')
See original GitHub issue(Comes from https://github.com/dask/distributed/issues/1978#issuecomment-645869748)
What happened:
$ ipython
Python 3.8.3 (default, May 20 2020, 12:50:54)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: # coding: utf-8
...: from dask import dataframe as dd
...: import pandas as pd
...: from distributed import Client
...: client = Client()
...: df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
...: payment_types = {
...: 1: "Credit Card",
...: 2: "Cash",
...: 3: "No Charge",
...: 4: "Dispute",
...: 5: "Unknown",
...: 6: "Voided trip"
...: }
...: payment_names = pd.Series(
...: payment_types, name="payment_name"
...: ).to_frame()
...: df2 = df.merge(
...: payment_names, left_on="payment_type", right_index=True
...: )
...: op = df2.groupby("payment_name")["tip_amount"].mean()
...: client.compute(op)
...:
Out[1]: <Future: pending, key: finalize-85edcc1f23785545f628c932abd19768>
In [2]: distributed.worker - WARNING - Compute Failed
Function: _apply_chunk
args: ( VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag ... mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge payment_name
0 1 2019-01-04 14:08:46 2019-01-04 14:18:10 1 1.70 1 N ... 0.5 0.0 0.00 0.3 9.30 NaN Cash
1 1 2019-01-04 14:20:33 2019-01-04 14:25:10 1 0.90 1 N ... 0.5 0.0 0.00 0.3 6.30 NaN Cash
13 2 2019-01-04 14:14:45 2019-01-04 14:26:00 5 1.63 1 N ... 0.5 0.0 0.00 0.3 9.80 NaN Cash
15 2 2019-01-04 14:49:45 2019-01-04 15:0
kwargs: {'chunk': <methodcaller: sum>, 'columns': 'tip_amount'}
Exception: ValueError('buffer source array is read-only')
In [2]:
In [2]: client
Out[2]: <Client: 'tcp://127.0.0.1:33689' processes=4 threads=4, memory=16.70 GB>
In [3]: _1
Out[3]: <Future: error, key: finalize-85edcc1f23785545f628c932abd19768>
What you expected to happen: The operation finishes without error.
Minimal Complete Verifiable Example:
# coding: utf-8
from dask import dataframe as dd
import pandas as pd
from distributed import Client
client = Client()
df = dd.read_csv("../data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"])
payment_types = {
1: "Credit Card",
2: "Cash",
3: "No Charge",
4: "Dispute",
5: "Unknown",
6: "Voided trip"
}
payment_names = pd.Series(
payment_types, name="payment_name"
).to_frame()
df2 = df.merge(
payment_names, left_on="payment_type", right_index=True
)
op = df2.groupby("payment_name")["tip_amount"].mean()
client.compute(op)
Data:
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
Anything else we need to know?: I managed to avoid this error by reducing the number of files, but then it hit me again at a later point. I expect this behavior to be dependent on the available RAM.
Environment:
- Dask version: 2.18.1
- Python version: 3.8.3
- Operating System: Linux Mint 19.3
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Tackle "ValueError: buffer source array is read-only" #1978
array with a distributed scheduler is the dreaded ValueError: buffer source array is read-only . This error is typical when one runs a ......
Read more >Pandas ValueError: buffer source array is read-only
This is a bug in the latest release of pandas (0.23.x) and will be solved in pandas 0.24+. This issue was reported already...
Read more >Typed Memoryviews — Cython 3.0.0a11 documentation
Typed memoryviews allow efficient access to memory buffers, such as those underlying NumPy arrays, without incurring any Python overhead. Memoryviews are ...
Read more >buffer source array is read-only with ds.map_batches and ...
I am facing problems processing the text data using ds.map_batches with pandas as the batch format. Getting ValueError: buffer source array is ......
Read more >Apache Arrow in PySpark — PySpark 3.2.0 documentation
Typically, you would see the error ValueError: buffer source array is read-only . Newer versions of Pandas may fix these errors by improving...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks Juan! 😀
Will look at cleaning it up and adding some more tests.
I confirm that #3918 fixed the issue with the same data: