pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames.
See original GitHub issuea = pd.DataFrame({'f1': [1,2,3]})
b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series([4,4,4]).astype('category')})
pd.concat((a,b), sort=True).dtypes
>> f1 int64
>> f2 object
>> dtype: object
Problem description
(Similar to #14016, not sure if it’s caused by the same bug or another one. feel free to merge) When concatenating two DataFrames where one has a categorical column that the other is missing, the result contains the categorical column as a ‘object’ (losing the “real” dtype).
If we were to fill the missing column with Nones (but with the same categorical dtype), the concatenation would keep the dtype. In the previous example, adding:
a['f2'] = pd.Series([None, None, None]).astype(b.dtypes['f2'])
before concatenating, will solve the problem.
I believe if a field is missing from one of the merged dataframes, a reasonable behavior would be to copy it and preserve its dtype.
Expected Output
Column ‘f2’ should be a categorical (same as b[‘f2’]).
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.9.4 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:7 (3 by maintainers)
hello,
since append is deprecated, I’ve migrated all my
df.append(temp)
todf = pd.concat([df, temp])
Usually, I have processing where I do something like:
Here, since out is empty df at first, it will not keep dtypes from the temp df. For instance, if I have a datetime column, it’s converted as object.
Is that expected ? Considering append is deprecated this has huge impact.
This can have severe memory consequences.