question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv() & extra trailing comma(s) cause parsing issues.

See original GitHub issue

I have run into a few opportunities to further improve the wonderful read_csv() function. I’m using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

Symptoms:

  1. With extra trailing commas and setting index_col=False read_csv() fails with: IndexError: list index out of range
  2. When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. 😃

-Gagi

If a CSV has ‘n’ matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first ‘n’ columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import StringIO

In [4]: pd.__version__
Out[4]: '0.10.1'

In [5]: data = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,' <-- Matched extra commas

In [6]: data2 = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' <-- Miss-matched extra commas

In [7]: print data
a,b,c
4,apple,bat,,
8,orange,cow,,

In [8]: print data2
a,b,c
4,apple,bat,,
8,orange,cow,,,

In [9]: df = pd.read_csv(StringIO.StringIO(data))

In [10]: df
Out[10]:             a   b   c
4 apple   bat NaN NaN
8 orange  cow NaN NaN

In [11]: df.index
Out[11]: MultiIndex
[(4, apple), (8, orange)]

In [12]: df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-4d7ece45eef3> in <module>()
----> 1 df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()

IndexError: list index out of range

In [13]: df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-13-441bfd4aff6e> in <module>()
----> 1 df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.raise_parser_error (pandas\src\parser.c:17023)()

CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6
- show quoted text -

Issue Analytics

  • State:closed
  • Created 11 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

6reactions
BastinRobincommented, Jun 21, 2015

I sorted this issue with use of csv module. My code look like

import csv import pandas as pd import numpy as np

df = pd.read_csv(‘filename.csv’, parse_dates=True, dtype=Object, delimiter=“\t”, quoting=csv.QUOTE_NONE, encoding=‘utf-8’)

6reactions
davidjameshumphreyscommented, Mar 22, 2013

With the mismatched trailing commas 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' if I call read_csv(..., error_bad_lines=False) omitting the index_col=False then it will keep processing the data but will drop the bad line.

If index_col=False is added in then it will fail with the error as described in 1 above.

I have a similar issue processing files where the last field is freeform text and the separator is sometimes included.

I have two different workarounds depending on the situation:

  1. if it is just trailing commas then I pre-process the data to remove the commas.
  2. For the case where the final field contains the separator I pre-process the file using:

perl -pne '$_ = join("|", split(/,/, $_, 4) )' < input.csv > output.csv

where 4 is the max number of fields that I want to have. This option requires a second delimiter (all valid , get changed to | and any , after the 4th is unchanged).

I hope this workaround helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trailing delimiter confuses pandas read_csv - Stack Overflow
It treats the extra delimiter as if there's an extra column. So there's one more column than what headers require. Then pandas.read_csv takes ......
Read more >
How To Fix pandas.parser.CParserError: Error tokenizing data
The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing...
Read more >
No Trailing Comma, CParserError: Error tokenizing data. C error
When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data.
Read more >
No Trailing Comma, CParserError: Error tokenizing data. C
that occasionally have extra trailing columns causing read_csv() to fail. Pe rhaps its possible to have an option to ignore trailing commas, or...
Read more >
SFTP Syncs: Troubleshooting CSV formatting errors
Saving an Excel file as a CSV file can create extra commas at the end of each row. Trailing commas can result when...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found