Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error with read_csv using dask.dataframe

See original GitHub issue

I get an error when using dd.read_csv using Dask 0.9.0. This works with Dask 0.8.2 and pandas 0.18.1. Steps to reproduce are below:

Download a sample CSV file from the GDELT dataset:

$ wget https://s3.amazonaws.com/blaze-data/gdelt/csv/20140101.export.csv

Perform an operation with Dask.Dataframe:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('20140101.export.csv', sep='\t', header=None)
>>> df.head()

ValueError                                Traceback (most recent call last)
<ipython-input-3-2569c44faf66> in <module>()
----> 1 df.head()

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
    369 
    370         if compute:
--> 371             result = result.compute()
    372         return result
    373 

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     35 
     36     def compute(self, **kwargs):
---> 37         return compute(self, **kwargs)[0]
     38 
     39     @classmethod

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    108                 for opt, val in groups.items()])
    109     keys = [var._keys() for var in variables]
--> 110     results = get(dsk, keys, **kwargs)
    111 
    112     results_iter = iter(results)

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58 
     59     return results

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    486                 _execute_task(task, data)  # Re-execute locally
    487             else:
--> 488                 raise(remote_exception(res, tb))
    489         state['cache'][key] = res
    490         finish_task(dsk, key, state, results, keyorder.get)

ValueError: could not convert string to float: 'HRW'

Traceback
---------
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 43, in bytes_read_csv
    coerce_dtypes(df, dtypes)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 67, in coerce_dtypes
    df[c] = df[c].astype(dtypes[c])
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2950, in astype
    raise_on_error=raise_on_error, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2938, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2890, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 434, in astype
    values=values, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 477, in _astype
    values = com._astype_nansafe(values.ravel(), dtype, copy=True)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/common.py", line 1920, in _astype_nansafe
    return arr.astype(dtype)

Issue Analytics

State:
Created 7 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, May 2, 2017

I believe this has been resolved (through both code and doc improvements). Closing.

0reactions

Umersaeed81commented, May 19, 2021

cell = dd.read_csv(‘*.csv’,
skiprows=[0,1,2,3,4,5],
skipfooter=1, engine=‘python’,
na_values=[‘NIL’,‘/0’], parse_dates=[“Date”],assume_missing=True)

Top Results From Across the Web

Dealing with Parse Errors when reading in csv via dask ...

However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below....

dask.dataframe.read_csv - Dask documentation

Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using...

Reading CSV files into Dask DataFrames with read_csv

This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.

Errors reading CSV file into Dask dataframe #1921 - GitHub

If you're on a recent-ish version, the error message you're getting indicates a dtype mismatch between the first sample bytes (using the name...

Dask gives KeyError with read_csv - Dask DataFrame

This annoying error means that Pandas can not find your column name in your dataframe. Before doing anything with the data frame, use...