Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How should I store frames with multiindex columns in CSV?

See original GitHub issue

Hello.

I tried to save a dataframe with MultiIndex used as columns to a CSV file and load it back, but I had no luck.

# Create a fame with multiindex columns
frame = pd.DataFrame({('AAPL', 'OPEN'): [1, 2, 3, 4], ('AAPL', 'CLOSE'): [1, 2, 3, 4], ('MSFT', 'OPEN'): [1, 2, 3, 4], ('MSFT', 'CLOSE'): [1, 2, 3, 4]})
# Make sure it was created as wanted.
frame
#   AAPL       MSFT     
#   CLOSE OPEN CLOSE OPEN
# 0     1    1     1    1
# 1     2    2     2    2
# 2     3    3     3    3
# 3     4    4     4    4

# Try to convert the frame to CSV
s1 = frame.to_csv()
s2 = frame.to_csv(tupleize_cols=True)
# FutureWarning displayed - tupleize_cols is deprecated.

print(s1)
# ,AAPL,AAPL,MSFT,MSFT
# ,CLOSE,OPEN,CLOSE,OPEN
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

print(s2)
# ,"('AAPL', 'CLOSE')","('AAPL', 'OPEN')","('MSFT', 'CLOSE')","('MSFT', 'OPEN')"
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

# Read the CSV strings back to DataFrames
f1 = pd.read_csv(StringIO(s1))
f2 = pd.read_csv(StringIO(s2), tupleize_cols=True)
# Warning about tupleize_cols here

# Both frames does not look like the original one.
f1
#    Unnamed: 0   AAPL AAPL.1   MSFT MSFT.1
# 0         NaN  CLOSE   OPEN  CLOSE   OPEN
# 1         0.0      1      1      1      1
# 2         1.0      2      2      2      2
# 3         2.0      3      3      3      3
# 4         3.0      4      4      4      4

f2
#    Unnamed: 0  ('AAPL', 'CLOSE')  ('AAPL', 'OPEN')  ('MSFT', 'CLOSE')  ('MSFT', 'OPEN')
# 0           0                  1                 1                  1                 1
# 1           1                  2                 2                  2                 2
# 2           2                  3                 3                  3                 3
# 3           3                  4                 4                  4                 4

As you see, both frames don’t have multiindexed columns as original one. So, how should I save a DataFrame with multiindexed columns to CSV file and load it back to get a frame same to the original one?

I also tried to save as JSON, but also encountered problems. Here is what the frame shown above is converted to.

frame.to_json()
'{"["AAPL","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["AAPL","OPEN"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","OPEN"]":{"0":1,"1":2,"2":3,"3":4}}'

So, tupleized multiindexed column names are obviously incorrectly quoted.

With best regards,

Alex.

INSTALLED VERSIONS

commit: None python: 3.4.2.final.0 python-bits: 32 OS: Linux OS-release: 3.16.0-6-686-pae machine: i686 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+318.g272bbdc pytest: 3.6.3 pip: 1.5.6 setuptools: 5.5.1 Cython: 0.28.4 numpy: 1.14.5 scipy: None pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

State:
Created 5 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

8reactions

chris-b1commented, Jul 19, 2018

read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.

from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)

In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]: 
  AAPL       MSFT     
 CLOSE OPEN CLOSE OPEN
0     1    1     1    1
1     2    2     2    2
2     3    3     3    3
3     4    4     4    4

0reactions

WillAydcommented, Jul 20, 2018

Can you open a separate bug for the JSON orient="split" issue? That does seem off.

As a side note on orient="table":

Timestamp support is being added in #21827
Integers are not valid column labels but should be fine as an index; again if you have an example you can provide please open as a bug in a separate issue