question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error with read_csv using dask.dataframe

See original GitHub issue

I get an error when using dd.read_csv using Dask 0.9.0. This works with Dask 0.8.2 and pandas 0.18.1. Steps to reproduce are below:

Download a sample CSV file from the GDELT dataset:

$ wget https://s3.amazonaws.com/blaze-data/gdelt/csv/20140101.export.csv

Perform an operation with Dask.Dataframe:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('20140101.export.csv', sep='\t', header=None)
>>> df.head()

ValueError                                Traceback (most recent call last)
<ipython-input-3-2569c44faf66> in <module>()
----> 1 df.head()

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
    369 
    370         if compute:
--> 371             result = result.compute()
    372         return result
    373 

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     35 
     36     def compute(self, **kwargs):
---> 37         return compute(self, **kwargs)[0]
     38 
     39     @classmethod

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    108                 for opt, val in groups.items()])
    109     keys = [var._keys() for var in variables]
--> 110     results = get(dsk, keys, **kwargs)
    111 
    112     results_iter = iter(results)

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58 
     59     return results

/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    486                 _execute_task(task, data)  # Re-execute locally
    487             else:
--> 488                 raise(remote_exception(res, tb))
    489         state['cache'][key] = res
    490         finish_task(dsk, key, state, results, keyorder.get)

ValueError: could not convert string to float: 'HRW'

Traceback
---------
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 43, in bytes_read_csv
    coerce_dtypes(df, dtypes)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 67, in coerce_dtypes
    df[c] = df[c].astype(dtypes[c])
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2950, in astype
    raise_on_error=raise_on_error, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2938, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2890, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 434, in astype
    values=values, **kwargs)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 477, in _astype
    values = com._astype_nansafe(values.ravel(), dtype, copy=True)
  File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/common.py", line 1920, in _astype_nansafe
    return arr.astype(dtype)

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, May 2, 2017

I believe this has been resolved (through both code and doc improvements). Closing.

0reactions
Umersaeed81commented, May 19, 2021

cell = dd.read_csv(‘*.csv’,
skiprows=[0,1,2,3,4,5],
skipfooter=1, engine=‘python’,
na_values=[‘NIL’,‘/0’], parse_dates=[“Date”],assume_missing=True)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dealing with Parse Errors when reading in csv via dask ...
However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below....
Read more >
dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using...
Read more >
Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >
Errors reading CSV file into Dask dataframe #1921 - GitHub
If you're on a recent-ish version, the error message you're getting indicates a dtype mismatch between the first sample bytes (using the name...
Read more >
Dask gives KeyError with read_csv - Dask DataFrame
This annoying error means that Pandas can not find your column name in your dataframe. Before doing anything with the data frame, use...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found