question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design of DataFrame.where() is potentially counter-intuitive

See original GitHub issue

Code Sample

>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64

Problem description

This is a purely design-related issue!

I think that the behavior of the DataFrame.where() is potentially confusing, and can easily be misunderstood. The core of the issue is the use of the other parameter.

The behavior when other is set by the user can be confusing,

>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64

Reading s.where(s > 1, 10), I think the logical way to read that, or to use a function called .where(), is:

Replace values in s where s > 1 is True with 10

While the proper use of other may make sense if you’re fully familiar with the default behavior of .where(), it can be confusing to those reading it for the first time.

I think this boils down to the function being,

Replace values where the condition is False

Instead of,

Replace values where the condition is True

which I think is the expected behavior based on the function name.

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Expected Output

Changing the behavior of .where() to be:

Replace values where the condition is True

This would then give,

>>> s = pd.Series(range(5))
>>> s.where(s > 1, 10)
0    0
1    1
2    10
3    10
4    10
dtype: int64

So, to replicate the output of the initial snippet,

>>> s = pd.Series(range(5))
>>> s.where(s < 2, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64

Output of pd.show_versions()

Not applicable

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
scwilkinsoncommented, Mar 5, 2019

@Liam3851 I embarrassingly didn’t know that’s what mask did. I always assumed it had something to do with masked arrays in numpy for some reason, and never had a need for it in pandas so never read the docs!

That mnemonic is actually really helpful, thank you for sharing! I’m just going to use mask from now on since it behaves how I expect where to and matches more closely the way I think about filtering data.

I do wonder if the existence of both mask and where is redundant. I still think that where is confusing to read. I’ve asked several colleagues how they expect it to behave, and none were able to guess correctly.

1reaction
Liam3851commented, Mar 5, 2019

@scwilkinson There’s already DataFrame.mask that does what you suggest, so I don’t think we really need another alias in the library? You could perhaps monkey patch

pd.Series.where_true = pd.Series.mask

if you find the name confusing.

Personally, I usually think of DataFrame.where like the python ternary operator, if that mnemonic helps. The ternary operator has form:

true_value if cond else false_value

which well matches

df.where(cond, false_value)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Intro to data structures — pandas 1.5.2 documentation
Inspired by dplyr's mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from...
Read more >
Fast, Flexible, Easy and Intuitive: How to Speed Up Your ...
One point that should not be forgotten when you are using Pandas is that Pandas Series and DataFrames are designed on top of...
Read more >
Python | Pandas DataFrame - GeeksforGeeks
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Read more >
bold moves level 258
Bold Moves Level 258 This level is counterintuitive because ice is unaffected by power tiles like the sunflower or the lily, so the...
Read more >
19 mayıs üniversitesi taban puanları - Goinma.com is for sale
Son güncelleme Mar 16, 2020 0. ONDOKUZ MAYIS ÜNİVERSİTESİ (OMÜ) SAMSUN 2020 TABAN PUANLARI VE BAŞARI SIRALAMALARI KPSSCini.com tarafından ÖSYM ve YÖK tarafından ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found