Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`dask.dataframe.read_csv('./filepath/*.csv')` returning tuple

See original GitHub issue

What happened: Loading a dataframe seemingly returned a tuple, rather than a dask.dataframe, as an exception was thrown: AttributeError: 'tuple' object has no attribute 'sample'

What you expected to happen: I expected for the code below to return a pandas.DataFrame with the correlations that I’m looking for!

Minimal Complete Verifiable Example:

import dask.dataframe as daskdf
from dask.distributed import Client

client = Client(memory_limit='4GB', processes=False)

raw_df = daskdf.read_csv(os.path.join(input_file_path, '*.csv'))
df = raw_df.sample(frac=0.01).drop(['gaugeid', 'time', 'input', 'labels'], 1)
correlations = df.corr().compute()

Anything else we need to know?: The example runs fine on my local machine (Windows 10, Dask 2021.1.1, Python 3.8.5), it is just failing when run in containerised compute provided by Azure.

The full traceback is here:

Traceback (most recent call last):
  File "correlation_analysis.py", line 43, in <module>
    correlations = df.corr().compute()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 285, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 2673, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1982, in gather
    return self.sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 853, in sync
    return sync(
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 354, in sync
    raise exc.with_traceback(tb)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/utils.py", line 337, in f
    result[0] = yield future
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/distributed/client.py", line 1847, in _gather
    raise exception.with_traceback(traceback)
  File "/azureml-envs/azureml_datastore/lib/python3.8/site-packages/dask/dataframe/methods.py", line 352, in sample
    return df.sample(random_state=rs, frac=frac, replace=replace) if len(df) > 0 else df
AttributeError: 'tuple' object has no attribute 'sample'

Environment:

Dask version: 2021.6.0
Python version: 3.8.1
Operating System: Linux
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

3reactions

snowoodycommented, Dec 15, 2021

I run into a similar issue with dask.dataframe.read_csv().compute() returning a tuple instead of a pandas dataframe.

Set dask.config.set({“optimization.fuse.active”: True}) in the code or set processes=True when starting the Client both can solve the problem. Dask version: 2021.12.0 pandas version: 1.3.5 Python version: 3.7

1reaction

umonacacommented, Aug 30, 2021

I can confirm this bug exists and I solved it by just removing the processes=False option. So I think the observation above is probably correct. Dask version: 2021.8.0

Top Results From Across the Web

Dask read_csv failes to read from BytesIO - Stack Overflow

I have the following code to read a gzipped csv file from bytes. It works with pandas.read_csv, however ...

dask.dataframe.read_csv - Dask documentation

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......

read_csv ValueError: Unknown protocol c · Issue #1415 - GitHub

I am getting an unknown protocol c error while trying to read_csv into the data frame. My code is below. import dask.dataframe as...

Reading CSV files into Dask DataFrames with read_csv

This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

text. CSV. read_csv. to_csv. text. Fixed-Width Text File. read_fwf. text ... If the parsed data only contains one column then return a Series...