Appending Pandas dataframes in for loop results in ValueError
See original GitHub issueI recently posted this on StackOverflow. It seems to be a bug so I am posting here as well.
I want to generate a dataframe that is created by appended several separate dataframes generated in a for loop. Each individual dataframe consists of a name column, a range of integers and a column identifying a category to which the integer belongs (e.g. quintile 1 to 5). If I generate each dataframe individually and then append one to the other to create a ‘master’ dataframe then there are no problems. However, when I use a loop to create each individual dataframe then trying to append a dataframe to the master dataframe results in:
ValueError: incompatible categories in categorical concat
A work-around (suggested by jezrael) involved appending each dataframe to a list of dataframes and concatenating them using pd.concat.
I’ve written a simplified loop to illustrate:
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
# Define column names
colNames = ('a','b','c')
# Define a dataframe with the required column names
masterDF = pd.DataFrame(columns = colNames)
# A list of the group names
names = ['Group1','Group2','Group3']
# Create a dataframe for each group
for i in names:
tempDF = pd.DataFrame(columns = colNames)
tempDF['a'] = np.arange(1,11,1)
tempDF['b'] = i
tempDF['c'] = pd.cut(np.arange(1,11,1),
bins = np.linspace(0,10,6),
labels = [1,2,3,4,5])
print(tempDF)
print('\n')
# Try to append temporary DF to master DF
masterDF = masterDF.append(tempDF,ignore_index=True)
print(masterDF)
Expected Output
a b c
0 1 Group1 1
1 2 Group1 1
2 3 Group1 2
3 4 Group1 2
4 5 Group1 3
5 6 Group1 3
6 7 Group1 4
7 8 Group1 4
8 9 Group1 5
9 10 Group1 5
10 11 Group2 1
11 12 Group2 1
12 13 Group2 2
13 14 Group2 2
...
28 29 Group3 5
29 30 Group3 5
output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.4.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8
pandas: 0.18.1 nose: None pip: 1.5.6 setuptools: 20.1.1 Cython: None numpy: 1.11.0 scipy: 0.16.1 statsmodels: None xarray: None IPython: 4.1.1 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.0 openpyxl: 2.3.2 xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: 0.7.4.None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None
Issue Analytics
- State:
- Created 7 years ago
- Comments:13 (11 by maintainers)
Well, if we say that an empty series is ordered=False, then it should actually raise an error instead of changing the order of the result 😃 But actually, in this case, you don’t have an empty categorical, but just an empty frame without dtype info, so in this case it should ignore the fact that that part is ordered or not.
I met the same problem in #13626 and wrote short summary of
Series
Index
differences.How about following spec:
union_categorical
ordered
should be preserved.