`dask.dataframe.read_csv('./filepath/*.csv')` returning tuple
See original GitHub issueWhat happened:
Loading a dataframe seemingly returned a tuple, rather than a dask.dataframe, as an exception was thrown:
AttributeError: 'tuple' object has no attribute 'sample'
What you expected to happen:
I expected for the code below to return a pandas.DataFrame with the correlations that I’m looking for!
Minimal Complete Verifiable Example:
import dask.dataframe as daskdf
from dask.distributed import Client
client = Client(memory_limit='4GB', processes=False)
raw_df = daskdf.read_csv(os.path.join(input_file_path, '*.csv'))
df = raw_df.sample(frac=0.01).drop(['gaugeid', 'time', 'input', 'labels'], 1)
correlations = df.corr().compute()
Anything else we need to know?: The example runs fine on my local machine (Windows 10, Dask 2021.1.1, Python 3.8.5), it is just failing when run in containerised compute provided by Azure.
The full traceback is here:
Traceback (most recent call last):
File "correlation_analysis.py", line 43, in <module>
correlations = df.corr().compute()
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 285, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 567, in compute
results = schedule(dsk, keys, **kwargs)
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 2673, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1982, in gather
return self.sync(
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 853, in sync
return sync(
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 354, in sync
raise exc.with_traceback(tb)
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 337, in f
result[0] = yield future
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1847, in _gather
raise exception.with_traceback(traceback)
File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/dataframe/methods.py", line 352, in sample
return df.sample(random_state=rs, frac=frac, replace=replace) if len(df) > 0 else df
AttributeError: 'tuple' object has no attribute 'sample'
Environment:
- Dask version: 2021.6.0
- Python version: 3.8.1
- Operating System: Linux
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Dask read_csv failes to read from BytesIO - Stack Overflow
I have the following code to read a gzipped csv file from bytes. It works with pandas.read_csv, however ...
Read more >dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >read_csv ValueError: Unknown protocol c · Issue #1415 - GitHub
I am getting an unknown protocol c error while trying to read_csv into the data frame. My code is below. import dask.dataframe as...
Read more >Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
text. CSV. read_csv. to_csv. text. Fixed-Width Text File. read_fwf. text ... If the parsed data only contains one column then return a Series...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I run into a similar issue with
dask.dataframe.read_csv().compute()returning a tuple instead of a pandas dataframe.Set
dask.config.set({“optimization.fuse.active”: True})in the code or setprocesses=Truewhen starting theClientboth can solve the problem. Dask version: 2021.12.0 pandas version: 1.3.5 Python version: 3.7I can confirm this bug exists and I solved it by just removing the
processes=Falseoption. So I think the observation above is probably correct. Dask version: 2021.8.0