Unhelpful error message in `read_csv` when mistakenly infer object column as number.
See original GitHub issueI’m not sure how to begin debugging this, and I can’t share the data directly. Perhaps I could be pointed in the right direction.
I’m running Dask 0.15.0 in Python 2.7.6.
When I import this CSV in Pandas, it works correctly. If I do the same import function with Dask, and try to do anything with it (i.e. .head(10), or .unique(), etc.), I immediately get an error. It doesn’t matter if read_csv infers the types, or if I set all columns to strings, I still get the same error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-aa54a2f8d1de> in <module>()
----> 1 unique_stuff = my_df['Value'].unique().compute()
/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(self, **kwargs)
95 Extra keywords to forward to the scheduler ``get`` function.
96 """
---> 97 (result,) = compute(self, traverse=False, **kwargs)
98 return result
99
/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
202 dsk = collections_to_dsk(variables, optimize_graph, **kwargs)
203 keys = [var._keys() for var in variables]
--> 204 results = get(dsk, keys, **kwargs)
205
206 results_iter = iter(results)
/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
73 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
74 cache=cache, get_id=_thread_get_id,
---> 75 pack_exception=pack_exception, **kwargs)
76
77 # Cleanup pools associated to dead threads
/usr/local/lib/python2.7/dist-packages/dask/local.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
519 _execute_task(task, data) # Re-execute locally
520 else:
--> 521 raise_exception(exc, tb)
522 res, worker_id = loads(res_info)
523 state['cache'][key] = res
/usr/local/lib/python2.7/dist-packages/dask/local.pyc in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
288 try:
289 task, data = loads(task_info)
--> 290 result = _execute_task(task, data)
291 id = get_id()
292 result = dumps((result, id))
/usr/local/lib/python2.7/dist-packages/dask/local.pyc in _execute_task(arg, cache, dsk)
269 func, args = arg[0], arg[1:]
270 args2 = [_execute_task(a, cache) for a in args]
--> 271 return func(*args2)
272 elif not ishashable(arg):
273 return arg
/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce)
59 df = reader(bio, **kwargs)
60 if dtypes:
---> 61 coerce_dtypes(df, dtypes)
62
63 if enforce and columns and (list(df.columns) != list(columns)):
/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in coerce_dtypes(df, dtypes)
101 raise ValueError(msg % (missing_list, missing_dict))
102
--> 103 df[c] = df[c].astype(dtypes[c])
104
105
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error, **kwargs)
3052 # else, only a single dtype is given
3053 new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054 raise_on_error=raise_on_error, **kwargs)
3055 return self._constructor(new_data).__finalize__(self)
3056
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
3187
3188 def astype(self, dtype, **kwargs):
-> 3189 return self.apply('astype', dtype=dtype, **kwargs)
3190
3191 def convert(self, **kwargs):
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3054
3055 kwargs['mgr'] = self
-> 3056 applied = getattr(b, f)(**kwargs)
3057 result_blocks = _extend_blocks(applied, result_blocks)
3058
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values, **kwargs)
459 **kwargs):
460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461 values=values, **kwargs)
462
463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
502
503 # _astype_nansafe works fine with 1-d only
--> 504 values = _astype_nansafe(values.ravel(), dtype, copy=True)
505 values = values.reshape(self.shape)
506
/usr/local/lib/python2.7/dist-packages/pandas/types/cast.pyc in _astype_nansafe(arr, dtype, copy)
535
536 if copy:
--> 537 return arr.astype(dtype)
538 return arr.view(dtype)
539
ValueError: could not convert string to float: adsads 016333000054
Any suggestions?
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
Pandas read_csv dtype inference on file with many int ...
How this works: dtypes["something"] returns by default type int , except for the columns that have been specified beforehand. Approach 2: In ...
Read more >Common CSV Template Error Messages and How to Fix Them
An error message that begins “Failed to parse file” indicates that the uploaded CSV file is invalid in some way. Watershed supports UTF-8...
Read more >Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >Python Tutorial: Handling errors and missing data - YouTube
Luckily, read CSV offers ways to address these issues during import, ... When importing data, pandas infers each column's data type.
Read more >Troubleshoot designer component errors - Azure Machine ...
This error may occur if you have manually typed in a column name or if the column selector has provided a suggested column...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The issue here is that dask coordinates many pandas dataframes. To ease correctness, we enforce that all partitions have the same schema at all points in time. Thus, our
read_csv
implementation does the following:dtype
kwargpd.read_csv
on each partition, and enforces that the dtypes of that partition match those inferred/given in step 1. If they don’t, we error (as you noticed).Step 3 only happens at compute time (after
.compute()
is called), while the rest happens at graph build time. Subsequent operations (likeunique
orastype
) add steps in the graph after theread_csv
calls, so our dtype enforcement never sees your subsequent calls toastype
.Does that make sense?
Note that for performance reasons you only want things that need to be object dtype to be object dtype. Numeric columns should still be
float
/int
(apologies if you already knew this).This was fixed in #2522. Closing.