Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames.

See original GitHub issue

a = pd.DataFrame({'f1': [1,2,3]})
b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series([4,4,4]).astype('category')})

pd.concat((a,b), sort=True).dtypes
>> f1     int64
>> f2    object
>> dtype: object

Problem description

(Similar to #14016, not sure if it’s caused by the same bug or another one. feel free to merge) When concatenating two DataFrames where one has a categorical column that the other is missing, the result contains the categorical column as a ‘object’ (losing the “real” dtype).

If we were to fill the missing column with Nones (but with the same categorical dtype), the concatenation would keep the dtype. In the previous example, adding:

a['f2'] = pd.Series([None, None, None]).astype(b.dtypes['f2'])

before concatenating, will solve the problem.

I believe if a field is missing from one of the merged dataframes, a reasonable behavior would be to copy it and preserve its dtype.

Expected Output

Column ‘f2’ should be a categorical (same as b[‘f2’]).

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.9.4 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:7 (3 by maintainers)

Top GitHub Comments

3reactions

yeyericcommented, Feb 23, 2022

hello,

since append is deprecated, I’ve migrated all my df.append(temp) to df = pd.concat([df, temp])

Usually, I have processing where I do something like:

out = pd.DataFrame()
for _, temp in df.groupby('key'):
    # SOME PROCESSING OF DATA
    out = pd.concat([out, temp]) # before: out = out.append(temp)

Here, since out is empty df at first, it will not keep dtypes from the temp df. For instance, if I have a datetime column, it’s converted as object.

Is that expected ? Considering append is deprecated this has huge impact.

1reaction

climatebradcommented, Dec 7, 2019

This can have severe memory consequences.

Top Results From Across the Web

Retaining categorical dtype upon dataframe concatenation

I have two dataframes with identical column names and ...

Categorical data — pandas 1.5.2 documentation

If the slicing operation returns either a DataFrame or a column of type Series , the category dtype is preserved. ... Returning a...

Using pandas categories properly is tricky... here's why

When merging on categorical columns, be aware that to preserve the categorical nature, the categorical types in the merge columns of each ...

Pandas Integration — Apache Arrow v10.0.1

Pandas categorical columns are converted to Arrow dictionary arrays, a special array type optimized to handle repeated and limited number of possible values....

Input contains NaN when onehotencoding | Data Science and ...

step4: remove the categorical columns fron the dataframe ( we will add the one hot encoded ones dont worry). num_X_test = imputed_X_test.drop(object_cols, ...