question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No roundtrip DataFrame.to/from_csv() with multiindex columns

See original GitHub issue

Code Sample, a copy-pastable example if possible


import pandas as pd
import numpy as np

# create a dataframe with multiindex columns
arrays = [['A','A','B','B'],['a','b','a','b']]
tuples = list(zip(*arrays))
columnIndex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)



# save it to csv
A.to_csv('test.csv')
print(A.columns)

# try to do a round trip...
B = pd.DataFrame.from_csv('test.csv')
print(B.columns)

Output:

 A.colums =  MultiIndex(levels=[['A', 'B'], ['a', 'b']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
       names=['first', 'second'])

 B.columns = Index(['A', 'A.1', 'B', 'B.1'], dtype='object')

Problem description

I would expect .to_csv() and .from_csv() to result in a round trip, however the column multiindex is not being read correctly by .from_csv(). It is possible to use csv_read(), but it requires more extra parameters. I think it would be sufficient to add a parameter to .from_csv similar to index_col=sequence in csv_read()

Expected Output

A.columns == B.columns

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
farleylaicommented, Nov 10, 2017

To make things more interesting, what about an additional column with only one level header? What is the right parameters to_csv() and from_csv()/read_csv() so as to recover the orginial DataFrame A without unnecessary ‘Unnamed:XXX’?

>>> arrays = [['A','A','B','B'],['a','b','a','b']]
>>> tuples = list(zip(*arrays))
>>> columnIndex = pd.MultiIndex.from_tuples(tuples)
>>> A = pd.DataFrame(data=np.random.randn(4,4),columns=columnIndex)
>>> A
          A                   B          
          a         b         a         b
0 -1.325581  0.734176 -0.503851  0.593437
1 -0.480105  0.179591  0.326949 -0.669441
2 -1.784733  0.516683 -0.785407 -0.794819
3 -0.235099  1.292330 -0.089105 -1.825709

>>> A['product'] = ['p1','p2','p3','p4']
>>> A
          A                   B           product
          a         b         a         b        
0 -1.325581  0.734176 -0.503851  0.593437      p1
1 -0.480105  0.179591  0.326949 -0.669441      p2
2 -1.784733  0.516683 -0.785407 -0.794819      p3
3 -0.235099  1.292330 -0.089105 -1.825709      p4

Now save and reread A:

>>> A.to_csv('A.csv')
>>> AA = pd.DataFrame.from_csv('A.csv', header=[0,1])
>>> AA
          A                   B                      product
          a         b         a         b Unnamed: 5_level_1
0 -1.325581  0.734176 -0.503851  0.593437                 p1
1 -0.480105  0.179591  0.326949 -0.669441                 p2
2 -1.784733  0.516683 -0.785407 -0.794819                 p3
3 -0.235099  1.292330 -0.089105 -1.825709                 p4

Any idea to get rid of the Unnamed things?

0reactions
gfyoungcommented, Jul 27, 2017

Ah, true. I think we both overlooked this. Generally, we encourage people to use read_csv, as per the docs.

By all means, feel free to update the documentation as a PR, though we should consider just deprecating the function (@jreback thoughts?) in the future.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CSV & Pandas: Unnamed columns and multi-index
I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike...
Read more >
pandas.read_csv — pandas 1.5.2 documentation
Read a comma-separated values (csv) file into DataFrame. ... be a list of integers that specify row locations for a multi-index on the...
Read more >
How to import CSV file with multi-level columns (Python Basics)
Here, we use pd.MultiIndex.from_tuples() to create new column-names rows. In a straightforward way, you can write column names one by one. If a ......
Read more >
apache_beam.dataframe.io module - Apache Beam
The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that...
Read more >
Manipulating DataFrames with Pandas - Trenton McKinney
Read in filename using pd.read_csv() and set the index to 'county' by specifying the index_col parameter. Create a separate DataFrame results with the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found