Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unhelpful error message in `read_csv` when mistakenly infer object column as number.

See original GitHub issue

I’m not sure how to begin debugging this, and I can’t share the data directly. Perhaps I could be pointed in the right direction.

I’m running Dask 0.15.0 in Python 2.7.6.

When I import this CSV in Pandas, it works correctly. If I do the same import function with Dask, and try to do anything with it (i.e. .head(10), or .unique(), etc.), I immediately get an error. It doesn’t matter if read_csv infers the types, or if I set all columns to strings, I still get the same error:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-aa54a2f8d1de> in <module>()
----> 1 unique_stuff = my_df['Value'].unique().compute()

/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(self, **kwargs)
     95             Extra keywords to forward to the scheduler ``get`` function.
     96         """
---> 97         (result,) = compute(self, traverse=False, **kwargs)
     98         return result
     99 

/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
    202     dsk = collections_to_dsk(variables, optimize_graph, **kwargs)
    203     keys = [var._keys() for var in variables]
--> 204     results = get(dsk, keys, **kwargs)
    205 
    206     results_iter = iter(results)

/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
     73     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     74                         cache=cache, get_id=_thread_get_id,
---> 75                         pack_exception=pack_exception, **kwargs)
     76 
     77     # Cleanup pools associated to dead threads

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in _execute_task(arg, cache, dsk)
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce)
     59     df = reader(bio, **kwargs)
     60     if dtypes:
---> 61         coerce_dtypes(df, dtypes)
     62 
     63     if enforce and columns and (list(df.columns) != list(columns)):

/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in coerce_dtypes(df, dtypes)
    101                 raise ValueError(msg % (missing_list, missing_dict))
    102 
--> 103             df[c] = df[c].astype(dtypes[c])
    104 
    105 

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error, **kwargs)
   3052         # else, only a single dtype is given
   3053         new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054                                      raise_on_error=raise_on_error, **kwargs)
   3055         return self._constructor(new_data).__finalize__(self)
   3056 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
   3187 
   3188     def astype(self, dtype, **kwargs):
-> 3189         return self.apply('astype', dtype=dtype, **kwargs)
   3190 
   3191     def convert(self, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3054 
   3055             kwargs['mgr'] = self
-> 3056             applied = getattr(b, f)(**kwargs)
   3057             result_blocks = _extend_blocks(applied, result_blocks)
   3058 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    459                **kwargs):
    460         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461                             values=values, **kwargs)
    462 
    463     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
    502 
    503                 # _astype_nansafe works fine with 1-d only
--> 504                 values = _astype_nansafe(values.ravel(), dtype, copy=True)
    505                 values = values.reshape(self.shape)
    506 

/usr/local/lib/python2.7/dist-packages/pandas/types/cast.pyc in _astype_nansafe(arr, dtype, copy)
    535 
    536     if copy:
--> 537         return arr.astype(dtype)
    538     return arr.view(dtype)
    539 

ValueError: could not convert string to float: adsads 016333000054

Any suggestions?

Issue Analytics

State:
Created 6 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, Jun 29, 2017

(I assumed, maybe incorrectly) that using .astype will change the type on the fly.

The issue here is that dask coordinates many pandas dataframes. To ease correctness, we enforce that all partitions have the same schema at all points in time. Thus, our read_csv implementation does the following:

Infers dtypes, using given dtypes if provided via the dtype kwarg
Figures out where to split the file(s) into partitions
Calls pd.read_csv on each partition, and enforces that the dtypes of that partition match those inferred/given in step 1. If they don’t, we error (as you noticed).

Step 3 only happens at compute time (after .compute() is called), while the rest happens at graph build time. Subsequent operations (like unique or astype) add steps in the graph after the read_csv calls, so our dtype enforcement never sees your subsequent calls to astype.

Does that make sense?

When I convert all columns to objects during CSV import, the error goes away.

Note that for performance reasons you only want things that need to be object dtype to be object dtype. Numeric columns should still be float/int (apologies if you already knew this).

0reactions

jcristcommented, Aug 2, 2017

This was fixed in #2522. Closing.