FR: Allow duplicate column names in pandas.read_csv
See original GitHub issueRight now, pandas’s read_csv() supports forcing column names read from CSV data to be unique:
>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']
The documentation suggests that passing mangle_dupe_cols=False
to read_csv()
will change this behavior to one where it’ll overwrite data on load. That doesn’t seem to be implemented as of this version:
>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet
However! Pandas doesn’t fundamentally disallow duplicate column names. There’s a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:
>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0 a a b
0 1 2 3
1 4 5 6
(Then again, maybe this isn’t so simple; there’s a funny “0” in there with the column headings and that seems weird.)
Problem description
I would like a native way to read CSV files with repeated headers. In my application, it’s literally so I can warn people about duplicated column headers. Yes, I could use python’s built-in csv
module for this, but then I’m using two methods to read CSV files and it gets weird.
Since mangle_dupe_cols=False
is not yet implemented, I might propose this behavior in this case.
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.2.7 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: 5.4.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:5 (1 by maintainers)
Probably a slightly better “turn the first row into column headers” incantation:
Also, I’ve upgraded to pandas 0.22 and all this works in that version as well.
While this doesn’t directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I’ve resorted to a pre-load of the header row with
and then any values with count > 1 can be reported to the user.
As a general issue though,
mangle_dupe_cols
still needs implementation by the looks of it (as of Pandas version 1.4.1).