Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More consistent na_values handling in read_csv

See original GitHub issue

The current handling of the na_values argument to read_csv is strangely different depending on what kind of value you pass to na_values. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.

This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It’s also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: “If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA.” But passing an empty list doesn’t result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.

Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you’re not overriding the defaults, you’re adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like “na” and “nan” are valid data and should be read as their literal string values.

There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values and more_na_values, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas).

Issue Analytics

State:
Created 11 years ago
Comments:13 (8 by maintainers)

Top GitHub Comments

3reactions

kayvonrcommented, Feb 5, 2014

Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both keep_default_na=False and na_values='' still results in “Namibia”, country code “NA”, being read in as a float NaN

In [56]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values='')

In [57]: data.ix[160:165]
Out[57]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [61]: data.ix[163, 'geo_country']
Out[61]: nan

In [62]: type(data.ix[163, 'geo_country'])
Out[62]: float

0reactions

neilsercommented, Jan 7, 2018

I can’t get a dictionary of na_values to work properly for me, no matter what I try. Version is 0.22.0. hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Two variants of my code - the one with the list does what I expect, but the dict version doesn’t:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head()

output:

0	1	2	3	4	5	6
113125	NaN	/blaha	kjsdkj	412.166	225.874	NaN
729639	qwer	NaN	asdfkj	466.681	NaN	252.373

but the dict version:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head()

is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:

0	1	2	3	4	5	6
113125	blah	/blaha	kjsdkj	412.166	225.874	214.008
729639	qwer		asdfkj	466.681	NaN	252.373

So… I’m stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict. [Edit: oddly, the dict version does create NaNs in columns I didn’t specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values.] Btw: to be picky, the docs for keep_default_na are a bit misleading, in that they imply there should be no effect unless na_values is supplied (but in fact there is an effect even when na_values isn’t supplied).