Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.read_csv vs dd.read_csv

See original GitHub issue

For some reason, I’m able to read a bzip2 compressed csv using pd.read_csv but run into an error with dd.read_csv:

df = pd.read_csv("csv_input.csv.bz2", header=True)
len(df.index)

df = dd.read_csv("csv_input.csv.bz2", header=True)

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-20-c595dcfccf8d> in <module>()
----> 1 df = dd.read_csv("csv_input.csv.bz2", header=True)

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/dask/dataframe/csv.py in read_csv(urlpath, blocksize, chunkbytes, collection, lineterminator, compression, sample, enforce, storage_options, **kwargs)
    231     else:
    232         header = sample.split(b_lineterminator)[0] + b_lineterminator
--> 233     head = pd.read_csv(BytesIO(sample), **kwargs)
    234 
    235     df = read_csv_from_bytes(values, header, head, kwargs,

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    560                     skip_blank_lines=skip_blank_lines)
    561 
--> 562         return _read(filepath_or_buffer, kwds)
    563 
    564     parser_f.__name__ = name

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    313 
    314     # Create the parser.
--> 315     parser = TextFileReader(filepath_or_buffer, **kwds)
    316 
    317     if (nrows is not None) and (chunksize is not None):

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    643             self.options['has_index_names'] = kwds['has_index_names']
    644 
--> 645         self._make_engine(self.engine)
    646 
    647     def close(self):

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    797     def _make_engine(self, engine='c'):
    798         if engine == 'c':
--> 799             self._engine = CParserWrapper(self.f, **self.options)
    800         else:
    801             if engine == 'python':

/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1211         kwds['allow_leading_cols'] = self.index_col is not False
   1212 
-> 1213         self._reader = _parser.TextReader(src, **kwds)
   1214 
   1215         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)()

pandas/parser.pyx in pandas.parser.TextReader._get_header (pandas/parser.c:7634)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 10: invalid start byte

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Sep 17, 2016

Have you tried using the compression= keyword?

On Sat, Sep 17, 2016 at 4:18 PM, Shyam Saladi notifications@github.com wrote:

Initially here: bokeh/datashader#240 https://github.com/bokeh/datashader/issues/240

The file is all standard text (nothing outside of ASCII characters), with string and float fields

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/1554#issuecomment-247805091, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBTeXY7qpJlDNYF3jZWuyJiSvRc1ks5qrEsYgaJpZM4J_s_8 .

0reactions

mrocklincommented, Oct 12, 2016

Thanks for reporting. Closing this for now.

Top Results From Across the Web

Pandas - pandas.DataFrame.from_csv vs pandas.read_csv

There is no real difference (both are based on the same underlying function), but as noted in the comments, they have some different...

It's Time to Say GoodBye to pd.read_csv() and pd.to_csv()

Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It's frustrating when I see ample scope ...

pandas.read_csv — pandas 1.5.2 documentation

Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks. Additional help can be...

Read CSV with Pandas - Python Tutorial

The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file...

pandas.DataFrame.from_csv vs pandas.read_csv - YouTube

Pandas : Pandas - pandas.DataFrame.from_csv vs pandas. read_csv [ Beautify Your Computer : https://www.hows.tech/p/recommended.html ] Pandas ...