Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No roundtrip DataFrame.to/from_csv() with multiindex columns

See original GitHub issue

Code Sample, a copy-pastable example if possible


import pandas as pd
import numpy as np

# create a dataframe with multiindex columns
arrays = [['A','A','B','B'],['a','b','a','b']]
tuples = list(zip(*arrays))
columnIndex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)



# save it to csv
A.to_csv('test.csv')
print(A.columns)

# try to do a round trip...
B = pd.DataFrame.from_csv('test.csv')
print(B.columns)

Output:

 A.colums =  MultiIndex(levels=[['A', 'B'], ['a', 'b']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
       names=['first', 'second'])

 B.columns = Index(['A', 'A.1', 'B', 'B.1'], dtype='object')

Problem description

I would expect .to_csv() and .from_csv() to result in a round trip, however the column multiindex is not being read correctly by .from_csv(). It is possible to use csv_read(), but it requires more extra parameters. I think it would be sufficient to add a parameter to .from_csv similar to index_col=sequence in csv_read()

Expected Output

A.columns == B.columns

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Comments:15 (14 by maintainers)

Top GitHub Comments

2reactions

farleylaicommented, Nov 10, 2017

To make things more interesting, what about an additional column with only one level header? What is the right parameters to_csv() and from_csv()/read_csv() so as to recover the orginial DataFrame A without unnecessary ‘Unnamed:XXX’?

>>> arrays = [['A','A','B','B'],['a','b','a','b']]
>>> tuples = list(zip(*arrays))
>>> columnIndex = pd.MultiIndex.from_tuples(tuples)
>>> A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)
>>> A
          A                   B          
          a         b         a         b
0 -1.325581  0.734176 -0.503851  0.593437
1 -0.480105  0.179591  0.326949 -0.669441
2 -1.784733  0.516683 -0.785407 -0.794819
3 -0.235099  1.292330 -0.089105 -1.825709

>>> A['product'] = ['p1','p2','p3','p4']
>>> A
          A                   B           product
          a         b         a         b        
0 -1.325581  0.734176 -0.503851  0.593437      p1
1 -0.480105  0.179591  0.326949 -0.669441      p2
2 -1.784733  0.516683 -0.785407 -0.794819      p3
3 -0.235099  1.292330 -0.089105 -1.825709      p4

Now save and reread A:

>>> A.to_csv('A.csv')
>>> AA = pd.DataFrame.from_csv('A.csv', header=[0,1])
>>> AA
          A                   B                      product
          a         b         a         b Unnamed: 5_level_1
0 -1.325581  0.734176 -0.503851  0.593437                 p1
1 -0.480105  0.179591  0.326949 -0.669441                 p2
2 -1.784733  0.516683 -0.785407 -0.794819                 p3
3 -0.235099  1.292330 -0.089105 -1.825709                 p4

Any idea to get rid of the Unnamed things?

0reactions

gfyoungcommented, Jul 27, 2017

Ah, true. I think we both overlooked this. Generally, we encourage people to use read_csv, as per the docs.

By all means, feel free to update the documentation as a PR, though we should consider just deprecating the function (@jreback thoughts?) in the future.

Top Results From Across the Web

CSV & Pandas: Unnamed columns and multi-index

I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike...

pandas.read_csv — pandas 1.5.2 documentation

Read a comma-separated values (csv) file into DataFrame. ... be a list of integers that specify row locations for a multi-index on the...

How to import CSV file with multi-level columns (Python Basics)

Here, we use pd.MultiIndex.from_tuples() to create new column-names rows. In a straightforward way, you can write column names one by one. If a ......

apache_beam.dataframe.io module - Apache Beam

The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that...

Manipulating DataFrames with Pandas - Trenton McKinney

Read in filename using pd.read_csv() and set the index to 'county' by specifying the index_col parameter. Create a separate DataFrame results with the...