question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`dask.dataframe.read_csv('./filepath/*.csv')` returning tuple

See original GitHub issue

What happened: Loading a dataframe seemingly returned a tuple, rather than a dask.dataframe, as an exception was thrown: AttributeError: 'tuple' object has no attribute 'sample'

What you expected to happen: I expected for the code below to return a pandas.DataFrame with the correlations that I’m looking for!

Minimal Complete Verifiable Example:

import dask.dataframe as daskdf
from dask.distributed import Client

client = Client(memory_limit='4GB', processes=False)

raw_df = daskdf.read_csv(os.path.join(input_file_path, '*.csv'))
df = raw_df.sample(frac=0.01).drop(['gaugeid', 'time', 'input', 'labels'], 1)
correlations = df.corr().compute()

Anything else we need to know?: The example runs fine on my local machine (Windows 10, Dask 2021.1.1, Python 3.8.5), it is just failing when run in containerised compute provided by Azure.

The full traceback is here:

Traceback (most recent call last):
  File "correlation_analysis.py", line 43, in <module>
    correlations = df.corr().compute()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 2673, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1982, in gather
    return self.sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 853, in sync
    return sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 354, in sync
    raise exc.with_traceback(tb)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 337, in f
    result[0] = yield future
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1847, in _gather
    raise exception.with_traceback(traceback)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/dataframe/methods.py", line 352, in sample
    return df.sample(random_state=rs, frac=frac, replace=replace) if len(df) > 0 else df
AttributeError: 'tuple' object has no attribute 'sample'

Environment:

  • Dask version: 2021.6.0
  • Python version: 3.8.1
  • Operating System: Linux
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
snowoodycommented, Dec 15, 2021

I run into a similar issue with dask.dataframe.read_csv().compute() returning a tuple instead of a pandas dataframe.

Set dask.config.set({“optimization.fuse.active”: True}) in the code or set processes=True when starting the Client both can solve the problem. Dask version: 2021.12.0 pandas version: 1.3.5 Python version: 3.7

1reaction
umonacacommented, Aug 30, 2021

I can confirm this bug exists and I solved it by just removing the processes=False option. So I think the observation above is probably correct. Dask version: 2021.8.0

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask read_csv failes to read from BytesIO - Stack Overflow
I have the following code to read a gzipped csv file from bytes. It works with pandas.read_csv, however ...
Read more >
dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >
read_csv ValueError: Unknown protocol c · Issue #1415 - GitHub
I am getting an unknown protocol c error while trying to read_csv into the data frame. My code is below. import dask.dataframe as...
Read more >
Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
text. CSV. read_csv. to_csv. text. Fixed-Width Text File. read_fwf. text ... If the parsed data only contains one column then return a Series...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found