API: read_csv parsing of quoted "NA" value as NaN
See original GitHub issueFrom https://github.com/pandas-dev/pandas/issues/10647#issuecomment-123715600
In context of discussion on the default parsing of ‘NA’ in a csv file as NaN, I don’t think we can change the default of parsing NA
, but we may consider parsing of "NA"
as a string instead of NaN:
Another change that we would maybe be more likely to consider, is the parsing of quoted values of NA (so only changing “NA” not be converted automatically, but leaving NA and alike converted to NaN as it is now). Personally, I would even this consider as a bug that it treats “NA” and NA the same, as I would expect that quoted values should be left untouched. But I don’t know how long this behaviour has been this way.
Although I am not sure how invasive such a change would be, and difficult to assess, so maybe this is not worth risking many breakages?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:5 (4 by maintainers)
Top GitHub Comments
Example:
Would love to see column ‘a’ showing “NA, 0123” as strings.
This is a very typical problem for us, because we have so many “exchange code” = “NA”. Currently we need to use keep_default_na=False, and roll our own na_values, which is cumbersome and error prone.
For my situation, I find it counterproductive that a column specified as string still is vetted for NaN. On the other hand, I can imagine some users wanting this vetting for “NaN” to happen even for strings because they want to encode “N/A” responses. Perhaps the NaN vetting process could be controlled per column? Perhaps by allowing na_filter to take a dictionary of {col_name: boolean}.
fwiw, this may not be possible without changing the csv parser behavior: I use
read_csv
to import files where the NA value is literally"NAN"
but I have to specifyNAN
tona_values
parameter because double-quotes are automatically consumed.