pd.read_csv vs dd.read_csv
See original GitHub issueFor some reason, I’m able to read a bzip2 compressed csv using pd.read_csv
but run into an error with dd.read_csv
:
df = pd.read_csv("csv_input.csv.bz2", header=True)
len(df.index)
390626
df = dd.read_csv("csv_input.csv.bz2", header=True)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-20-c595dcfccf8d> in <module>()
----> 1 df = dd.read_csv("csv_input.csv.bz2", header=True)
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/dask/dataframe/csv.py in read_csv(urlpath, blocksize, chunkbytes, collection, lineterminator, compression, sample, enforce, storage_options, **kwargs)
231 else:
232 header = sample.split(b_lineterminator)[0] + b_lineterminator
--> 233 head = pd.read_csv(BytesIO(sample), **kwargs)
234
235 df = read_csv_from_bytes(values, header, head, kwargs,
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
560 skip_blank_lines=skip_blank_lines)
561
--> 562 return _read(filepath_or_buffer, kwds)
563
564 parser_f.__name__ = name
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
313
314 # Create the parser.
--> 315 parser = TextFileReader(filepath_or_buffer, **kwds)
316
317 if (nrows is not None) and (chunksize is not None):
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
643 self.options['has_index_names'] = kwds['has_index_names']
644
--> 645 self._make_engine(self.engine)
646
647 def close(self):
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
797 def _make_engine(self, engine='c'):
798 if engine == 'c':
--> 799 self._engine = CParserWrapper(self.f, **self.options)
800 else:
801 if engine == 'python':
/ul/saladi/anaconda3/envs/mdtraj_py3/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1211 kwds['allow_leading_cols'] = self.index_col is not False
1212
-> 1213 self._reader = _parser.TextReader(src, **kwds)
1214
1215 # XXX
pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5129)()
pandas/parser.pyx in pandas.parser.TextReader._get_header (pandas/parser.c:7634)()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 10: invalid start byte
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Pandas - pandas.DataFrame.from_csv vs pandas.read_csv
There is no real difference (both are based on the same underlying function), but as noted in the comments, they have some different...
Read more >It's Time to Say GoodBye to pd.read_csv() and pd.to_csv()
Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It's frustrating when I see ample scope ...
Read more >pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking of the file into chunks. Additional help can be...
Read more >Read CSV with Pandas - Python Tutorial
The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file...
Read more >pandas.DataFrame.from_csv vs pandas.read_csv - YouTube
Pandas : Pandas - pandas.DataFrame.from_csv vs pandas. read_csv [ Beautify Your Computer : https://www.hows.tech/p/recommended.html ] Pandas ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Have you tried using the compression= keyword?
On Sat, Sep 17, 2016 at 4:18 PM, Shyam Saladi notifications@github.com wrote:
Thanks for reporting. Closing this for now.