question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv C-engine CParserError: Error tokenizing data

See original GitHub issue

Hi,

I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

I get the following exception:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

If you try and read the CSV using the python engine then no exception is thrown:

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

Suggesting that the issue is with read_csv and not to_csv. The versions I using are:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:21
  • Comments:19 (3 by maintainers)

github_iconTop GitHub Comments

76reactions
justinjdickowcommented, Jan 10, 2018

I missed @alfonsomhc answer because it just looked like a comment.

You need

df = pd.read_csv('test.csv', engine='python')
43reactions
chris-b1commented, Sep 23, 2015

Your second-to-last line includes an '\r' break. I think it’s a bug, but one workaround is to open in universal-new-line mode.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')
Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Pandas Error tokenizing data - csv - Stack Overflow
If this error arises when reading a file written by pandas.to_csv() , it MIGHT be because there is a '\r' in a column...
Read more >
How To Fix pandas.parser.CParserError: Error tokenizing data
The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing...
Read more >
How to fix CParserError: Error tokenizing data
Fix it manually. The Error tokenizing data may arise when you're using separator (for eg. · pandas.to_csv() · skiprows. Sometimes the parser is...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
In [1]: import pandas as pd In [2]: from io import StringIO In [3]: data = "col1 ... _libs.parsers.raise_parser_error() ParserError: Error tokenizing data....
Read more >
How To Solve Python Pandas Error Tokenizing Data Error?
While reading a CSV file, you may get the “Pandas Error Tokenizing Data“. This mostly occurs due to the incorrect data in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found