question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unhelpful error message in `read_csv` when mistakenly infer object column as number.

See original GitHub issue

I’m not sure how to begin debugging this, and I can’t share the data directly. Perhaps I could be pointed in the right direction.

I’m running Dask 0.15.0 in Python 2.7.6.

When I import this CSV in Pandas, it works correctly. If I do the same import function with Dask, and try to do anything with it (i.e. .head(10), or .unique(), etc.), I immediately get an error. It doesn’t matter if read_csv infers the types, or if I set all columns to strings, I still get the same error:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-aa54a2f8d1de> in <module>()
----> 1 unique_stuff = my_df['Value'].unique().compute()

/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(self, **kwargs)
     95             Extra keywords to forward to the scheduler ``get`` function.
     96         """
---> 97         (result,) = compute(self, traverse=False, **kwargs)
     98         return result
     99 

/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
    202     dsk = collections_to_dsk(variables, optimize_graph, **kwargs)
    203     keys = [var._keys() for var in variables]
--> 204     results = get(dsk, keys, **kwargs)
    205 
    206     results_iter = iter(results)

/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
     73     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     74                         cache=cache, get_id=_thread_get_id,
---> 75                         pack_exception=pack_exception, **kwargs)
     76 
     77     # Cleanup pools associated to dead threads

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

/usr/local/lib/python2.7/dist-packages/dask/local.pyc in _execute_task(arg, cache, dsk)
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in pandas_read_text(reader, b, header, kwargs, dtypes, columns, write_header, enforce)
     59     df = reader(bio, **kwargs)
     60     if dtypes:
---> 61         coerce_dtypes(df, dtypes)
     62 
     63     if enforce and columns and (list(df.columns) != list(columns)):

/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in coerce_dtypes(df, dtypes)
    101                 raise ValueError(msg % (missing_list, missing_dict))
    102 
--> 103             df[c] = df[c].astype(dtypes[c])
    104 
    105 

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error, **kwargs)
   3052         # else, only a single dtype is given
   3053         new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054                                      raise_on_error=raise_on_error, **kwargs)
   3055         return self._constructor(new_data).__finalize__(self)
   3056 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, **kwargs)
   3187 
   3188     def astype(self, dtype, **kwargs):
-> 3189         return self.apply('astype', dtype=dtype, **kwargs)
   3190 
   3191     def convert(self, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3054 
   3055             kwargs['mgr'] = self
-> 3056             applied = getattr(b, f)(**kwargs)
   3057             result_blocks = _extend_blocks(applied, result_blocks)
   3058 

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    459                **kwargs):
    460         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461                             values=values, **kwargs)
    462 
    463     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
    502 
    503                 # _astype_nansafe works fine with 1-d only
--> 504                 values = _astype_nansafe(values.ravel(), dtype, copy=True)
    505                 values = values.reshape(self.shape)
    506 

/usr/local/lib/python2.7/dist-packages/pandas/types/cast.pyc in _astype_nansafe(arr, dtype, copy)
    535 
    536     if copy:
--> 537         return arr.astype(dtype)
    538     return arr.view(dtype)
    539 

ValueError: could not convert string to float: adsads 016333000054

Any suggestions?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Jun 29, 2017

(I assumed, maybe incorrectly) that using .astype will change the type on the fly.

The issue here is that dask coordinates many pandas dataframes. To ease correctness, we enforce that all partitions have the same schema at all points in time. Thus, our read_csv implementation does the following:

  1. Infers dtypes, using given dtypes if provided via the dtype kwarg
  2. Figures out where to split the file(s) into partitions
  3. Calls pd.read_csv on each partition, and enforces that the dtypes of that partition match those inferred/given in step 1. If they don’t, we error (as you noticed).

Step 3 only happens at compute time (after .compute() is called), while the rest happens at graph build time. Subsequent operations (like unique or astype) add steps in the graph after the read_csv calls, so our dtype enforcement never sees your subsequent calls to astype.

Does that make sense?

When I convert all columns to objects during CSV import, the error goes away.

Note that for performance reasons you only want things that need to be object dtype to be object dtype. Numeric columns should still be float/int (apologies if you already knew this).

0reactions
jcristcommented, Aug 2, 2017

This was fixed in #2522. Closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas read_csv dtype inference on file with many int ...
How this works: dtypes["something"] returns by default type int , except for the columns that have been specified beforehand. Approach 2: In ...
Read more >
Common CSV Template Error Messages and How to Fix Them
An error message that begins “Failed to parse file” indicates that the uploaded CSV file is invalid in some way. Watershed supports UTF-8...
Read more >
Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >
Python Tutorial: Handling errors and missing data - YouTube
Luckily, read CSV offers ways to address these issues during import, ... When importing data, pandas infers each column's data type.
Read more >
Troubleshoot designer component errors - Azure Machine ...
This error may occur if you have manually typed in a column name or if the column selector has provided a suggested column...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found