Design of DataFrame.where() is potentially counter-intuitive
See original GitHub issueCode Sample
>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0 10
1 10
2 2
3 3
4 4
dtype: int64
Problem description
This is a purely design-related issue!
I think that the behavior of the DataFrame.where()
is potentially confusing, and can easily be misunderstood. The core of the issue is the use of the other
parameter.
The behavior when other
is set by the user can be confusing,
>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0 10
1 10
2 2
3 3
4 4
dtype: int64
Reading s.where(s > 1, 10)
, I think the logical way to read that, or to use a function called .where()
, is:
Replace values in
s
wheres > 1
isTrue
with10
While the proper use of other
may make sense if you’re fully familiar with the default behavior of .where()
, it can be confusing to those reading it for the first time.
I think this boils down to the function being,
Replace values where the condition is False
Instead of,
Replace values where the condition is True
which I think is the expected behavior based on the function name.
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Expected Output
Changing the behavior of .where()
to be:
Replace values where the condition is True
This would then give,
>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0 0
1 1
2 10
3 10
4 10
dtype: int64
So, to replicate the output of the initial snippet,
>>> s = pd.Series(range(5))
>>> s.where(s < 2, 10)
0 10
1 10
2 2
3 3
4 4
dtype: int64
Output of pd.show_versions()
Not applicable
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
@Liam3851 I embarrassingly didn’t know that’s what
mask
did. I always assumed it had something to do with masked arrays in numpy for some reason, and never had a need for it in pandas so never read the docs!That mnemonic is actually really helpful, thank you for sharing! I’m just going to use
mask
from now on since it behaves how I expectwhere
to and matches more closely the way I think about filtering data.I do wonder if the existence of both
mask
andwhere
is redundant. I still think thatwhere
is confusing to read. I’ve asked several colleagues how they expect it to behave, and none were able to guess correctly.@scwilkinson There’s already
DataFrame.mask
that does what you suggest, so I don’t think we really need another alias in the library? You could perhaps monkey patchif you find the name confusing.
Personally, I usually think of
DataFrame.where
like the python ternary operator, if that mnemonic helps. The ternary operator has form:which well matches