question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More consistent na_values handling in read_csv

See original GitHub issue

The current handling of the na_values argument to read_csv is strangely different depending on what kind of value you pass to na_values. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.

This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It’s also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: “If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA.” But passing an empty list doesn’t result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.

Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you’re not overriding the defaults, you’re adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like “na” and “nan” are valid data and should be read as their literal string values.

There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values and more_na_values, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas).

Issue Analytics

  • State:closed
  • Created 11 years ago
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
kayvonrcommented, Feb 5, 2014

Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both keep_default_na=False and na_values='' still results in “Namibia”, country code “NA”, being read in as a float NaN

In [56]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values='')

In [57]: data.ix[160:165]
Out[57]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [61]: data.ix[163, 'geo_country']
Out[61]: nan

In [62]: type(data.ix[163, 'geo_country'])
Out[62]: float
0reactions
neilsercommented, Jan 7, 2018

I can’t get a dictionary of na_values to work properly for me, no matter what I try. Version is 0.22.0. hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Two variants of my code - the one with the list does what I expect, but the dict version doesn’t:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head()

output:

0 1 2 3 4 5 6
113125 NaN /blaha kjsdkj 412.166 225.874 NaN
729639 qwer NaN asdfkj 466.681 NaN 252.373

but the dict version:

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head()

is paying attention to the columns I specify, and then simply refusing to create any NaNs in those columns:

0 1 2 3 4 5 6
113125 blah /blaha kjsdkj 412.166 225.874 214.008
729639 qwer asdfkj 466.681 NaN 252.373

So… I’m stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict. [Edit: oddly, the dict version does create NaNs in columns I didn’t specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values.] Btw: to be picky, the docs for keep_default_na are a bit misleading, in that they imply there should be no effect unless na_values is supplied (but in fact there is an effect even when na_values isn’t supplied).

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to get consistent dtypes after to_csv and read_csv?
Try using QUOTE_NONE instead on the read, this preserves the datatypes between read/write. Using the original datset with int64 :
Read more >
4 tricks you should know to parse date columns with Pandas ...
1. Reading date columns from a CSV file. By default, date columns are represented as object when loading data from a CSV file....
Read more >
Pandas: How to Read and Write Files - Real Python
You'll use the Pandas read_csv() function to work with CSV files. ... In data science and machine learning, you must handle missing values...
Read more >
11 Data import - R for Data Science - Hadley Wickham
The first argument to read_csv() is the most important: it's the path to the file ... You can override the default value of...
Read more >
Reading data from a file using Pandas
Most of the things that pandas can do can be done with basic Python, ... analysis tasks more consistent in terms of syntax...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found