question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.read_csv vs dd.read_csv

See original GitHub issue

For some reason, I’m able to read a bzip2 compressed csv using pd.read_csv but run into an error with dd.read_csv:

df = pd.read_csv("csv_input.csv.bz2", header=True)
len(df.index)
390626
df = dd.read_csv("csv_input.csv.bz2", header=True)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-20-c595dcfccf8d> in <module>()
----> 1 df = dd.read_csv("csv_input.csv.bz2", header=True)

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/dask/dataframe/csv.py in read_csv(urlpath, blocksize, chunkbytes, collection, lineterminator, compression, sample, enforce, storage_options, **kwargs)
    231     else:
    232         header = sample.split(b_lineterminator)[0] + b_lineterminator
--> 233     head = pd.read_csv(BytesIO(sample), **kwargs)
    234 
    235     df = read_csv_from_bytes(values, header, head, kwargs,

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    313 
    314     # Create the parser.
--> 315     parser = TextFileReader(filepath_or_buffer, **kwds)
    316 
    317     if (nrows is not None) and (chunksize is not None):

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    643             self.options['has_index_names'] = kwds['has_index_names']
    644 
--> 645         self._make_engine(self.engine)
    646 
    647     def close(self):

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    797     def _make_engine(self, engine='c'):
    798         if engine == 'c':
--> 799             self._engine = CParserWrapper(self.f, **self.options)
    800         else:
    801             if engine == 'python':

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1211         kwds['allow_leading_cols'] = self.index_col is not False
   1212 
-> 1213         self._reader = _parser.TextReader(src, **kwds)
   1214 
   1215         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)()

pandas/parser.pyx in pandas.parser.TextReader._get_header (pandas/parser.c:7634)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 10: invalid start byte

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Sep 17, 2016

Have you tried using the compression= keyword?

On Sat, Sep 17, 2016 at 4:18 PM, Shyam Saladi notifications@github.com wrote:

Initially here: bokeh/datashader#240 https://github.com/bokeh/datashader/issues/240

The file is all standard text (nothing outside of ASCII characters), with string and float fields

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/1554#issuecomment-247805091, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBTeXY7qpJlDNYF3jZWuyJiSvRc1ks5qrEsYgaJpZM4J_s_8 .

0reactions
mrocklincommented, Oct 12, 2016

Thanks for reporting. Closing this for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas - pandas.DataFrame.from_csv vs pandas.read_csv
There is no real difference (both are based on the same underlying function), but as noted in the comments, they have some different...
Read more >
It's Time to Say GoodBye to pd.read_csv() and pd.to_csv()
Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It's frustrating when I see ample scope ...
Read more >
pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks. Additional help can be...
Read more >
Read CSV with Pandas - Python Tutorial
The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file...
Read more >
pandas.DataFrame.from_csv vs pandas.read_csv - YouTube
Pandas : Pandas - pandas.DataFrame.from_csv vs pandas. read_csv [ Beautify Your Computer : https://www.hows.tech/p/recommended.html ] Pandas ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found