[Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN)
See original GitHub issueGiven data, such as below in ‘csv’ format:
header 1, header 2 data 1, data 2 data 3, data 4, data 5
On import, Pandas will fail with the error:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3
There are many ways around this error:
- Loading the file in a spreadsheet program (i.e. Excel), and saving it as a new file
- Specifying
usecols
- Setting
error_bad_lines
to False
But there are instances where these workarounds are impractical:
- When you want to preserve all of your data (usecols and error_bad_lines will cause loss)
- When using a spreadsheet program is not feasible (i.e. encodings not supported by Excel, or when dealing with a large number of files (or in particular, the combination of the two))
Desired feature
In read_csv
and read_table
, add an expand_header
option, which when set to True
will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN
.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:8 (7 by maintainers)
Top Results From Across the Web
Handling Missing Data in Pandas: NaN Values Explained
You have a couple of alternatives to work with missing data. You can: Drop the whole row. Fill the row-column combination with some...
Read more >Data Cleaning with Python and Pandas: Detecting Missing ...
In this post we'll walk through a number of different data cleaning tasks ... focus on probably the biggest data cleaning task, missing...
Read more >Working with missing data — pandas 1.5.2 documentation
In this section, we will discuss missing (also referred to as NA) values in pandas. Note. The choice of using NaN internally to...
Read more >Fill Missing Values (Space Time Pattern Mining)—ArcGIS Pro
ArcGIS geoprocessing tool that replaces missing (null) values with estimated values based on spatial neighbors, space-time neighbors, or time-series, ...
Read more >Handling Missing Data | Python Data Science Handbook
In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward.
Good to know that you have encountered this issue with your datasets, but it’s important to separate between data issues and
read_csv
being too strict. In this case, I think it falls more on the former than the latter given your description of the data sources.Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue.
Option 2 is not generalizable because we don’t want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into
pandas
.Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4.
Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go.
There is another workaround that does preserve all data: specify header names manually that are long enough: