question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames.

See original GitHub issue
a = pd.DataFrame({'f1': [1,2,3]})
b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series([4,4,4]).astype('category')})

pd.concat((a,b), sort=True).dtypes
>> f1     int64
>> f2    object
>> dtype: object

Problem description

(Similar to #14016, not sure if it’s caused by the same bug or another one. feel free to merge) When concatenating two DataFrames where one has a categorical column that the other is missing, the result contains the categorical column as a ‘object’ (losing the “real” dtype).

If we were to fill the missing column with Nones (but with the same categorical dtype), the concatenation would keep the dtype. In the previous example, adding:

a['f2'] = pd.Series([None, None, None]).astype(b.dtypes['f2'])

before concatenating, will solve the problem.

I believe if a field is missing from one of the merged dataframes, a reasonable behavior would be to copy it and preserve its dtype.

Expected Output

Column ‘f2’ should be a categorical (same as b[‘f2’]).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.9.4 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:7
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
yeyericcommented, Feb 23, 2022

hello,

since append is deprecated, I’ve migrated all my df.append(temp) to df = pd.concat([df, temp])

Usually, I have processing where I do something like:

out = pd.DataFrame()
for _, temp in df.groupby('key'):
    # SOME PROCESSING OF DATA
    out = pd.concat([out, temp]) # before: out = out.append(temp)

Here, since out is empty df at first, it will not keep dtypes from the temp df. For instance, if I have a datetime column, it’s converted as object.

Is that expected ? Considering append is deprecated this has huge impact.

1reaction
climatebradcommented, Dec 7, 2019

This can have severe memory consequences.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Retaining categorical dtype upon dataframe concatenation
I have two dataframes with identical column names and ...
Read more >
Categorical data — pandas 1.5.2 documentation
If the slicing operation returns either a DataFrame or a column of type Series , the category dtype is preserved. ... Returning a...
Read more >
Using pandas categories properly is tricky... here's why
When merging on categorical columns, be aware that to preserve the categorical nature, the categorical types in the merge columns of each ...
Read more >
Pandas Integration — Apache Arrow v10.0.1
Pandas categorical columns are converted to Arrow dictionary arrays, a special array type optimized to handle repeated and limited number of possible values....
Read more >
Input contains NaN when onehotencoding | Data Science and ...
step4: remove the categorical columns fron the dataframe ( we will add the one hot encoded ones dont worry). num_X_test = imputed_X_test.drop(object_cols, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found