question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dataframe.fillna with df fails in a specific case

See original GitHub issue

The following code replace NaN values from a dataframe and works perfectly:

import pandas as pd
import numpy as np
df = pd.DataFrame({'key': ['01', '01', '01', '03', '04', '05'], 'A': [np.nan, 'A1', 'A2', 'A3', np.nan, np.nan], 'B': [1, 2, 3, np.nan, 5, 6]})
df2 = pd.DataFrame({'key': ['01', '03', '04', '05', '08', '99'], 'A': ['OK1', 'KO3', 'OK4', 'OK5', 'KO8', 'K99'], 'B': [91, 92, 93, 94, 95, 12]})
df.set_index('key').fillna(df2.set_index('key')).reset_index()

We obtain:

df
  key    A    B
0  01  NaN  1.0
1  01   A1  2.0
2  01   A2  3.0
3  03   A3  NaN
4  04  NaN  5.0
5  05  NaN  6.0

df2
  key    A   B
0  01  OK1  91
1  03  KO3  92
2  04  OK4  93
3  05  OK5  94
4  08  KO8  95
5  99  K99  12

res
  key    A     B
0  01  OK1   1.0
1  01   A1   2.0
2  01   A2   3.0
3  03   A3  92.0
4  04  OK4   5.0
5  05  OK5   6.0

However, the following minor change breaks everything for no apparent reason. When computing df3, we obtain an InvalidIndexError:

df.at[3, 'key'] = '99'
df_res = df.set_index('key').fillna(df2.set_index('key')).reset_index()

Here is the updated dataframe.

df
  key    A    B
0  01  NaN  1.0
1  01   A1  2.0
2  01   A2  3.0
3  99   A3  NaN
4  04  NaN  5.0
5  05  NaN  6.0

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mbatailloucommented, Dec 2, 2019

Hi everyone,

The concerns raised by @andreapiso are important but don’t relate to the issue pointed by @remidomingues.

This issue is related to the fact that the dataframe with missing values needs to have an increasing index if it contains repeated values. See the following example:

# This works
df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[0, 1, 1])
df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
df1.fillna(df2)

# This doesn't
df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[1, 1, 0])
df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
df1.fillna(df2)

Raising

----------------------------------------------------
InvalidIndexError  Traceback (most recent call last)
<ipython-input-118-819416bf371d> in <module>
      1 df1 = pd.DataFrame({'col1': [np.nan, np.nan, np.nan], 'col2': [1, np.nan, 1]}, index=[1, 1, 0])
      2 df2 = pd.DataFrame({'col1': [0, 0], 'col2': [0, 0]}, index=[1, 0])
----> 3 df1.fillna(df2)

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   4257             limit=limit,
   4258             downcast=downcast,
-> 4259             **kwargs
   4260         )
   4261 

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6280                 )
   6281             elif isinstance(value, DataFrame) and self.ndim == 2:
-> 6282                 new_data = self.where(self.notna(), value)
   6283             else:
   6284                 raise ValueError("invalid fill value with a %s" % type(value))

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in where(self, cond, other, inplace, axis, level, errors, try_cast)
   9274         other = com.apply_if_callable(other, self)
   9275         return self._where(
-> 9276             cond, other, inplace, axis, level, errors=errors, try_cast=try_cast
   9277         )
   9278 

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast)
   9029                     other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
   9030                 ):
-> 9031                     raise InvalidIndexError
   9032 
   9033             # slice me out of the other

InvalidIndexError: 

So a quick fix is to sort the index before .fillna but might be nice to receive a more informative error message.

Thanks a lot guys!

0reactions
willigottcommented, Jul 8, 2020

I do have a similar case for dataframes with a multi-index and the points made by @TomAugspurger and @mbataillou do not apply here, I think. I also asked it here.

Working:

filler1 = pd.DataFrame({
    'key': list('ACABCADD'),
    'g': [0, 1, 2, 0, 0, 1, 0, 1],
    'prop1': list('xyzuyasj'),
    'prop2': list('mnbbbqwo')
}).set_index(['key', 'g'])

tobefilled1 = pd.DataFrame({
    'key': list('AAABBCACDF'),
    'g': [0, 1, 2, 0, 1, 0, 3, 1, 0, 0],
    'keep_me': ['stuff'] * 10,
    'prop1': [np.nan] * 10,
    'prop2': [np.nan] * 10    
}).set_index(['key', 'g'])

print(tobefilled1.fillna(filler1))

will give

      keep_me prop1 prop2
key g                    
A   0   stuff     x     m
    1   stuff     a     q
    2   stuff     z     b
B   0   stuff     u     b
    1   stuff   NaN   NaN
C   0   stuff     y     b
A   3   stuff   NaN   NaN
C   1   stuff     y     n
D   0   stuff     s     w
F   0   stuff   NaN   NaN

So, the indexes are not monotonic and there are entries that cannot be matched; nevertheless it works fine.

A very similar case, however, fails:

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1657897'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1589778'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

df2.fillna(df1)

raises the InvalidIndexError

To me, both cases look identical, so I have no idea how to check whether I work with a valid or invalid index. Any ideas?

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas fillna not working - Stack Overflow
In the case where you are using a DataFrame, you can use DataFrame.where to use another frame's values to replace the values when...
Read more >
Pandas Fillna - Dealing with Missing Values - Datagy
The Pandas FillNa function allows you to fill missing values, with specifc values, previous values (back fill), and other computed values.
Read more >
pandas: Replace missing values (NaN) with fillna() - nkmk note
You can replace the missing value (NaN) in pandas.DataFrame and Series with any value using the fillna() method.pandas.
Read more >
Working with missing data — pandas 1.5.2 documentation
For object containers, pandas will use the value given: ... -0.173215 e NaN NaN NaN f NaN NaN NaN h NaN -0.706771 -1.039575...
Read more >
Pandas DataFrame fillna() Method - W3Schools
The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found