More consistent na_values handling in read_csv
See original GitHub issueThe current handling of the na_values
argument to read_csv
is strangely different depending on what kind of value you pass to na_values
. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.
This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It’s also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: “If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA.” But passing an empty list doesn’t result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.
Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you’re not overriding the defaults, you’re adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like “na” and “nan” are valid data and should be read as their literal string values.
There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values
and more_na_values
, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES
in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas)
.
Issue Analytics
- State:
- Created 11 years ago
- Comments:13 (8 by maintainers)
Top GitHub Comments
Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both
keep_default_na=False
andna_values=''
still results in “Namibia”, country code “NA”, being read in as a float NaNI can’t get a dictionary of
na_values
to work properly for me, no matter what I try. Version is 0.22.0. hack.csv contains:Two variants of my code - the one with the list does what I expect, but the dict version doesn’t:
output:
but the dict version:
is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:
So… I’m stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict. [Edit: oddly, the dict version does create NaNs in columns I didn’t specify in the dict, which also totally goes against my expectations for the combination of
keep_default_na=False
and an explicit value forna_values
.] Btw: to be picky, the docs forkeep_default_na
are a bit misleading, in that they imply there should be no effect unlessna_values
is supplied (but in fact there is an effect even whenna_values
isn’t supplied).