Error with read_csv using dask.dataframe
See original GitHub issueI get an error when using dd.read_csv
using Dask 0.9.0. This works with Dask 0.8.2 and pandas 0.18.1. Steps to reproduce are below:
Download a sample CSV file from the GDELT dataset:
$ wget https://s3.amazonaws.com/blaze-data/gdelt/csv/20140101.export.csv
Perform an operation with Dask.Dataframe:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('20140101.export.csv', sep='\t', header=None)
>>> df.head()
ValueError Traceback (most recent call last)
<ipython-input-3-2569c44faf66> in <module>()
----> 1 df.head()
/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
369
370 if compute:
--> 371 result = result.compute()
372 return result
373
/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
35
36 def compute(self, **kwargs):
---> 37 return compute(self, **kwargs)[0]
38
39 @classmethod
/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
108 for opt, val in groups.items()])
109 keys = [var._keys() for var in variables]
--> 110 results = get(dsk, keys, **kwargs)
111
112 results_iter = iter(results)
/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
55 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
56 cache=cache, queue=queue, get_id=_thread_get_id,
---> 57 **kwargs)
58
59 return results
/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
486 _execute_task(task, data) # Re-execute locally
487 else:
--> 488 raise(remote_exception(res, tb))
489 state['cache'][key] = res
490 finish_task(dsk, key, state, results, keyorder.get)
ValueError: could not convert string to float: 'HRW'
Traceback
---------
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 43, in bytes_read_csv
coerce_dtypes(df, dtypes)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/dask/dataframe/csv.py", line 67, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2950, in astype
raise_on_error=raise_on_error, **kwargs)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2938, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 2890, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 434, in astype
values=values, **kwargs)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 477, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/Users/koverholt/anaconda3/lib/python3.5/site-packages/pandas/core/common.py", line 1920, in _astype_nansafe
return arr.astype(dtype)
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Dealing with Parse Errors when reading in csv via dask ...
However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below....
Read more >dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using...
Read more >Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >Errors reading CSV file into Dask dataframe #1921 - GitHub
If you're on a recent-ish version, the error message you're getting indicates a dtype mismatch between the first sample bytes (using the name...
Read more >Dask gives KeyError with read_csv - Dask DataFrame
This annoying error means that Pandas can not find your column name in your dataframe. Before doing anything with the data frame, use...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I believe this has been resolved (through both code and doc improvements). Closing.
cell = dd.read_csv(‘*.csv’,
skiprows=[0,1,2,3,4,5],
skipfooter=1, engine=‘python’,
na_values=[‘NIL’,‘/0’], parse_dates=[“Date”],assume_missing=True)