question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected `ParserError` when loading data with `dask.dataframe.read_csv()`

See original GitHub issue

What happened:

Get unexpected ParserError when loading a csv file via dask.dataframe.read_csv(). However, loading the file directly with pandas.read_csv() and then converting to dask Dataframe via dask.dataframe.from_pandas() runs successfully.

What you expected to happen:

I expect dask.dataframe.read_csv() to successfully load the data if pandas.read_csv() is able to.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)

Anything else we need to know?:

This bug was found while running a dask notebook tutorial in Pangeo Tutorial Gallery, which runs on Pangeo Binder. This issue was originally reported here.

The error can be found below:

ParserError ---------------------------------------------------------------------------

ParserError Traceback (most recent call last) <ipython-input-1-7e661b6ebf9b> in <module> 6 7 # blocksize=None means use a single partion ----> 8 df = dd.read_csv(server+query, blocksize=None)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 578 storage_options=storage_options, 579 include_path_column=include_path_column, –> 580 **kwargs, 581 ) 582

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 444 445 # Use sample to infer dtypes and check for presence of include_path_column –> 446 head = reader(BytesIO(b_sample), **kwargs) 447 if include_path_column and (include_path_column in head.columns): 448 raise ValueError(

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 674 ) 675 –> 676 return _read(filepath_or_buffer, kwds) 677 678 parser_f.name = name

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 452 453 try: –> 454 data = parser.read(nrows) 455 finally: 456 parser.close()

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1131 def read(self, nrows=None): 1132 nrows = _validate_integer(“nrows”, nrows) -> 1133 ret = self._engine.read(nrows) 1134 1135 # May alter columns / col_dict

/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 2035 def read(self, nrows=None): 2036 try: -> 2037 data = self._reader.read(nrows) 2038 except StopIteration: 2039 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: EOF inside string starting at row 172

Environment:

  • Dask version: 2021.05.0 (and also 2.17.2)
  • Python version: 3.7
  • Operating System: Pangeo Binder (JupyterHub on the Cloud)
  • Install method (conda, pip, source): conda, conda-forge

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, May 20, 2021

Ah, after some prodding, I get it - the “sample”, which terminates on a newline, but inside a quoted string. This should be fixable by setting sample=False, but I believe that the server is also incorrectly reporting the size of the file, because it is being included as an HTTP attachment:

response.headers['Content-Disposition'] == 'attachment; filename=Smithsonian_VOTW_Holocene_Volcanoes.csv'

and the size of the preamble is being included in the apparent size.

So there are two issues:

  • sample is doing the wrong thing in dd.read_csv when blocksize=None. We indicate that splitting the file isn’t OK, but try to anyway. Futhermore, fsspec’s cat method might be more appropriate than open/read
  • fsspec’s HTTPFileSystem apparently doesn’t handle this “attachment” case (i.e., multi-part HTTP), which I haven’t seen before in this context.
0reactions
jrbourbeaucommented, Jun 21, 2021

Hmm that sounds sensible. Would that require a more recent version of pandas that use fsspec internally? If so, do you know how long that release has been out for?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask.dataframe.read_csv EOF Error when pandas works #4145
I came up with the solution of pandas.read_csv() and dask.dataframe.from_pandas() method, because pandas DataFrame handle nicely with the case ...
Read more >
Dask read_csv collapses columns while pandas read_csv don't
I'm having trouble reading a dask dataframe from multiple CSV files. ... pandas.errors.ParserError: unexpected end of data.
Read more >
How To Fix pandas.parser.CParserError: Error tokenizing data
The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing...
Read more >
dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >
Error tokenizing data. C error: EOF inside string starting at line
The solution was to use the parameter engine='python' in the read_csv function call. The Pandas CSV parser can use two different “engines” to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found