Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API: read_csv parsing of quoted "NA" value as NaN

See original GitHub issue

From https://github.com/pandas-dev/pandas/issues/10647#issuecomment-123715600

In context of discussion on the default parsing of ‘NA’ in a csv file as NaN, I don’t think we can change the default of parsing NA, but we may consider parsing of "NA" as a string instead of NaN:

Another change that we would maybe be more likely to consider, is the parsing of quoted values of NA (so only changing “NA” not be converted automatically, but leaving NA and alike converted to NaN as it is now). Personally, I would even this consider as a bug that it treats “NA” and NA the same, as I would expect that quoted values should be left untouched. But I don’t know how long this behaviour has been this way.

Although I am not sure how invasive such a change would be, and difficult to assess, so maybe this is not worth risking many breakages?

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

HHestcommented, Nov 29, 2017

Example:

pd.read_csv(StringIO('a,b\nNA,NA\n0123,0123'), dtype={'a': str})
Out[1]: 
      a      b
0   NaN    NaN
1  0123  123.0

Would love to see column ‘a’ showing “NA, 0123” as strings.

This is a very typical problem for us, because we have so many “exchange code” = “NA”. Currently we need to use keep_default_na=False, and roll our own na_values, which is cumbersome and error prone.

For my situation, I find it counterproductive that a column specified as string still is vetted for NaN. On the other hand, I can imagine some users wanting this vetting for “NaN” to happen even for strings because they want to encode “N/A” responses. Perhaps the NaN vetting process could be controlled per column? Perhaps by allowing na_filter to take a dictionary of {col_name: boolean}.

0reactions

patricktokeeffecommented, Jun 11, 2019

In context of discussion on the default parsing of ‘NA’ in a csv file as NaN, I don’t think we can change the default of parsing NA, but we may consider parsing of “NA” as a string instead of NaN:

fwiw, this may not be possible without changing the csv parser behavior: I use read_csv to import files where the NA value is literally "NAN" but I have to specify NAN to na_values parameter because double-quotes are automatically consumed.