Unexpected `ParserError` when loading data with `dask.dataframe.read_csv()`
See original GitHub issueWhat happened:
Get unexpected ParserError when loading a csv file via dask.dataframe.read_csv(). However, loading the file directly with pandas.read_csv() and then converting to dask Dataframe via dask.dataframe.from_pandas() runs successfully.
What you expected to happen:
I expect dask.dataframe.read_csv() to successfully load the data if pandas.read_csv() is able to.
Minimal Complete Verifiable Example:
import dask.dataframe as dd
url = 'https://webservices.volcano.si.edu/geoserver/GVP-VOTW/ows?service=WFS&version=2.0.0&request=GetFeature&typeName=GVP-VOTW:Smithsonian_VOTW_Holocene_Volcanoes&outputFormat=csv'
df = dd.read_csv(url, blocksize=None)
Anything else we need to know?:
This bug was found while running a dask notebook tutorial in Pangeo Tutorial Gallery, which runs on Pangeo Binder. This issue was originally reported here.
The error can be found below:
ParserError
---------------------------------------------------------------------------ParserError Traceback (most recent call last) <ipython-input-1-7e661b6ebf9b> in <module> 6 7 # blocksize=None means use a single partion ----> 8 df = dd.read_csv(server+query, blocksize=None)
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 578 storage_options=storage_options, 579 include_path_column=include_path_column, –> 580 **kwargs, 581 ) 582
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs) 444 445 # Use sample to infer dtypes and check for presence of include_path_column –> 446 head = reader(BytesIO(b_sample), **kwargs) 447 if include_path_column and (include_path_column in head.columns): 448 raise ValueError(
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 674 ) 675 –> 676 return _read(filepath_or_buffer, kwds) 677 678 parser_f.name = name
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 452 453 try: –> 454 data = parser.read(nrows) 455 finally: 456 parser.close()
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1131 def read(self, nrows=None): 1132 nrows = _validate_integer(“nrows”, nrows) -> 1133 ret = self._engine.read(nrows) 1134 1135 # May alter columns / col_dict
/srv/conda/envs/notebook/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 2035 def read(self, nrows=None): 2036 try: -> 2037 data = self._reader.read(nrows) 2038 except StopIteration: 2039 if self._first_chunk:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: EOF inside string starting at row 172
Environment:
- Dask version: 2021.05.0 (and also 2.17.2)
- Python version: 3.7
- Operating System: Pangeo Binder (JupyterHub on the Cloud)
- Install method (conda, pip, source): conda, conda-forge
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (4 by maintainers)

Top Related StackOverflow Question
Ah, after some prodding, I get it - the “sample”, which terminates on a newline, but inside a quoted string. This should be fixable by setting
sample=False, but I believe that the server is also incorrectly reporting the size of the file, because it is being included as an HTTP attachment:and the size of the preamble is being included in the apparent size.
So there are two issues:
blocksize=None. We indicate that splitting the file isn’t OK, but try to anyway. Futhermore,fsspec’scatmethod might be more appropriate thanopen/readHmm that sounds sensible. Would that require a more recent version of pandas that use
fsspecinternally? If so, do you know how long that release has been out for?