question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.where with category dtype

See original GitHub issue

Code Sample (it is copy-pastable)

import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(2*3).reshape(2,3), columns=list('abc'))
mask = np.random.rand(*df.shape) < 0.5
df.where(mask)
# Output is correct:
#      a   b    c
# 0  NaN NaN  2.0
# 1  3.0 NaN  NaN

df.a = df.a.astype('category')
df.b = df.b.astype('category')
df.c = df.c.astype('category')
df.where(mask)
# ValueError: Wrong number of items passed 2, placement implies 1
# Expected output: the same as before, but now with dtype `category`.

df.a.where(mask[:,0])
# 0    NaN
# 1    3.0
# Name: a, dtype: float64
# should stay in dtype category

df.a.where(mask[:,0], other=None)
# 0    None
# 1    3
# Name: a, dtype: object
# Expected output: should stay in dtype category

Problem description

df.where should work with all dtypes, the documentation doesn’t say it works only for some dtypes. Also, NaNs are already correctly handled as missing data in pd.Series of type ‘category’, so one should be able to assign NaNs to them. Same with converting the dtype.

While writing this report I found that doing it column-by-column works correctly, so I’ll use that as a workaround.

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS [1/1839]

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-81-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Ubuntu lsb_release -a:

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenial

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Nov 8, 2019

@ganevgv : I would try opening a PR with this test, but add print statements to confirm whether the dtype is actually changing. It might actually be a platform thing where the dtype is already int32.

1reaction
jrebackcommented, Jul 16, 2017

can you make a separate issue about the astype (and remove from the top from here).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas - Category variable and group by - is this a bug?
This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get...
Read more >
Using pandas categories properly is tricky... here's why
dtype.categories which contain the unique categorical values, rather than on the whole series. Merging with categorical columns.
Read more >
Categorical data — pandas 1.5.2 documentation
This information can be stored in a CategoricalDtype . The categories argument is optional, which implies that the actual categories should be inferred...
Read more >
Categorical Data — pandas 0.17.0 documentation - PyData |
Use categories to change the categories after creation time. To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray( ......
Read more >
Categorical Data — pandas 0.22.0 documentation
As a convenience, you can use the string 'category' in place of a CategoricalDtype when you want the default behavior of the categories...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found