Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dataframe.fillna with df fails in a specific case

See original GitHub issue

The following code replace NaN values from a dataframe and works perfectly:

import pandas as pd
import numpy as np
df = pd.DataFrame({'key': ['01', '01', '01', '03', '04', '05'], 'A': [np.nan, 'A1', 'A2', 'A3', np.nan, np.nan], 'B': [1, 2, 3, np.nan, 5, 6]})
df2 = pd.DataFrame({'key': ['01', '03', '04', '05', '08', '99'], 'A': ['OK1', 'KO3', 'OK4', 'OK5', 'KO8', 'K99'], 'B': [91, 92, 93, 94, 95, 12]})
df.set_index('key').fillna(df2.set_index('key')).reset_index()

We obtain:

df
  key    A    B
0  01  NaN  1.0
1  01   A1  2.0
2  01   A2  3.0
3  03   A3  NaN
4  04  NaN  5.0
5  05  NaN  6.0

df2
  key    A   B
0  01  OK1  91
1  03  KO3  92
2  04  OK4  93
3  05  OK5  94
4  08  KO8  95
5  99  K99  12

res
  key    A     B
0  01  OK1   1.0
1  01   A1   2.0
2  01   A2   3.0
3  03   A3  92.0
4  04  OK4   5.0
5  05  OK5   6.0

However, the following minor change breaks everything for no apparent reason. When computing df3, we obtain an InvalidIndexError:

df.at[3, 'key'] = '99'
df_res = df.set_index('key').fillna(df2.set_index('key')).reset_index()

Here is the updated dataframe.

df
  key    A    B
0  01  NaN  1.0
1  01   A1  2.0
2  01   A2  3.0
3  99   A3  NaN
4  04  NaN  5.0
5  05  NaN  6.0

Issue Analytics

State:
Created 4 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

mbatailloucommented, Dec 2, 2019

Hi everyone,

The concerns raised by @andreapiso are important but don’t relate to the issue pointed by @remidomingues.

This issue is related to the fact that the dataframe with missing values needs to have an increasing index if it contains repeated values. See the following example:

# This works
df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[0, 1, 1])
df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
df1.fillna(df2)

# This doesn't
df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[1, 1, 0])
df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
df1.fillna(df2)

Raising

----------------------------------------------------
InvalidIndexError  Traceback (most recent call last)
<ipython-input-118-819416bf371d> in <module>
      1 df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[1, 1, 0])
      2 df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
----> 3 df1.fillna(df2)

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4257             limit=limit,
   4258             downcast=downcast,
-> 4259             **kwargs
   4260         )
   4261 

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6280                 )
   6281             elif isinstance(value, DataFrame) and self.ndim == 2:
-> 6282                 new_data = self.where(self.notna(), value)
   6283             else:
   6284                 raise ValueError("invalid fill value with a %s" % type(value))

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in where(self, cond, other, inplace, axis, level, errors, try_cast)
   9274         other = com.apply_if_callable(other, self)
   9275         return self._where(
-> 9276             cond, other, inplace, axis, level, errors=errors, try_cast=try_cast
   9277         )
   9278 

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast)
   9029                     other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
   9030                 ):
-> 9031                     raise InvalidIndexError
   9032 
   9033             # slice me out of the other

InvalidIndexError:

So a quick fix is to sort the index before .fillna but might be nice to receive a more informative error message.

Thanks a lot guys!

0reactions

willigottcommented, Jul 8, 2020

I do have a similar case for dataframes with a multi-index and the points made by @TomAugspurger and @mbataillou do not apply here, I think. I also asked it here.

Working:

filler1 = pd.DataFrame({
    'key': list('ACABCADD'),
    'g': [0, 1, 2, 0, 0, 1, 0, 1],
    'prop1': list('xyzuyasj'),
    'prop2': list('mnbbbqwo')
}).set_index(['key', 'g'])

tobefilled1 = pd.DataFrame({
    'key': list('AAABBCACDF'),
    'g': [0, 1, 2, 0, 1, 0, 3, 1, 0, 0],
    'keep_me': ['stuff'] * 10,
    'prop1': [np.nan] * 10,
    'prop2': [np.nan] * 10    
}).set_index(['key', 'g'])

print(tobefilled1.fillna(filler1))

will give

      keep_me prop1 prop2
key g                    
A   0   stuff     x     m
    1   stuff     a     q
    2   stuff     z     b
B   0   stuff     u     b
    1   stuff   NaN   NaN
C   0   stuff     y     b
A   3   stuff   NaN   NaN
C   1   stuff     y     n
D   0   stuff     s     w
F   0   stuff   NaN   NaN

So, the indexes are not monotonic and there are entries that cannot be matched; nevertheless it works fine.

A very similar case, however, fails:

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1657897'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1589778'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

df2.fillna(df1)

raises the InvalidIndexError

To me, both cases look identical, so I have no idea how to check whether I work with a valid or invalid index. Any ideas?

Top Results From Across the Web

pandas fillna not working - Stack Overflow

In the case where you are using a DataFrame, you can use DataFrame.where to use another frame's values to replace the values when...

Pandas Fillna - Dealing with Missing Values - Datagy

The Pandas FillNa function allows you to fill missing values, with specifc values, previous values (back fill), and other computed values.

pandas: Replace missing values (NaN) with fillna() - nkmk note

You can replace the missing value (NaN) in pandas.DataFrame and Series with any value using the fillna() method.pandas.

Working with missing data — pandas 1.5.2 documentation

For object containers, pandas will use the value given: ... -0.173215 e NaN NaN NaN f NaN NaN NaN h NaN -0.706771 -1.039575...

Pandas DataFrame fillna() Method - W3Schools

The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter...