Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: DataFrame.where with category dtype

See original GitHub issue

Code Sample (it is copy-pastable)

import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(2*3).reshape(2,3), columns=list('abc'))
mask = np.random.rand(*df.shape) < 0.5
df.where(mask)
# Output is correct:
#      a   b    c
# 0  NaN NaN  2.0
# 1  3.0 NaN  NaN

df.a = df.a.astype('category')
df.b = df.b.astype('category')
df.c = df.c.astype('category')
df.where(mask)
# ValueError: Wrong number of items passed 2, placement implies 1
# Expected output: the same as before, but now with dtype `category`.

df.a.where(mask[:,0])
# 0    NaN
# 1    3.0
# Name: a, dtype: float64
# should stay in dtype category

df.a.where(mask[:,0], other=None)
# 0    None
# 1    3
# Name: a, dtype: object
# Expected output: should stay in dtype category

Problem description

df.where should work with all dtypes, the documentation doesn’t say it works only for some dtypes. Also, NaNs are already correctly handled as missing data in pd.Series of type ‘category’, so one should be able to assign NaNs to them. Same with converting the dtype.

While writing this report I found that doing it column-by-column works correctly, so I’ll use that as a workaround.

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS [1/1839]

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-81-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Ubuntu `lsb_release -a`:

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenial

Issue Analytics

State:
Created 6 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

gfyoungcommented, Nov 8, 2019

@ganevgv : I would try opening a PR with this test, but add print statements to confirm whether the dtype is actually changing. It might actually be a platform thing where the dtype is already int32.

1reaction

jrebackcommented, Jul 16, 2017

can you make a separate issue about the astype (and remove from the top from here).

Top Results From Across the Web

Pandas - Category variable and group by - is this a bug?

This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get...

Using pandas categories properly is tricky... here's why

dtype.categories which contain the unique categorical values, rather than on the whole series. Merging with categorical columns.

Categorical data — pandas 1.5.2 documentation

This information can be stored in a CategoricalDtype . The categories argument is optional, which implies that the actual categories should be inferred...

Categorical Data — pandas 0.17.0 documentation - PyData |

Use categories to change the categories after creation time. To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray( ......

Categorical Data — pandas 0.22.0 documentation

As a convenience, you can use the string 'category' in place of a CategoricalDtype when you want the default behavior of the categories...