question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN)

See original GitHub issue

Given data, such as below in ‘csv’ format:

header 1, header 2 data 1, data 2 data 3, data 4, data 5

On import, Pandas will fail with the error:

ParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3

There are many ways around this error:

  1. Loading the file in a spreadsheet program (i.e. Excel), and saving it as a new file
  2. Specifying usecols
  3. Setting error_bad_lines to False

But there are instances where these workarounds are impractical:

  1. When you want to preserve all of your data (usecols and error_bad_lines will cause loss)
  2. When using a spreadsheet program is not feasible (i.e. encodings not supported by Excel, or when dealing with a large number of files (or in particular, the combination of the two))

Desired feature

In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Aug 30, 2017

@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward.

Good to know that you have encountered this issue with your datasets, but it’s important to separate between data issues and read_csv being too strict. In this case, I think it falls more on the former than the latter given your description of the data sources.

Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue.

Option 2 is not generalizable because we don’t want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into pandas.

Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4.

Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go.

1reaction
jorisvandenbosschecommented, Aug 23, 2017

There is another workaround that does preserve all data: specify header names manually that are long enough:

In [90]: data = """a,b
    ...: 1,2
    ...: 1,2,3"""

In [92]: pd.read_csv(StringIO(data))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

In [93]: pd.read_csv(StringIO(data), names=['a', 'b', 'c'])
Out[93]: 
   a  b    c
0  a  b  NaN
1  1  2  NaN
2  1  2  3.0
Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling Missing Data in Pandas: NaN Values Explained
You have a couple of alternatives to work with missing data. You can: Drop the whole row. Fill the row-column combination with some...
Read more >
Data Cleaning with Python and Pandas: Detecting Missing ...
In this post we'll walk through a number of different data cleaning tasks ... focus on probably the biggest data cleaning task, missing...
Read more >
Working with missing data — pandas 1.5.2 documentation
In this section, we will discuss missing (also referred to as NA) values in pandas. Note. The choice of using NaN internally to...
Read more >
Fill Missing Values (Space Time Pattern Mining)—ArcGIS Pro
ArcGIS geoprocessing tool that replaces missing (null) values with estimated values based on spatial neighbors, space-time neighbors, or time-series, ...
Read more >
Handling Missing Data | Python Data Science Handbook
In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found