Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FR: Allow duplicate column names in pandas.read_csv

See original GitHub issue

Right now, pandas’s read_csv() supports forcing column names read from CSV data to be unique:

>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']

The documentation suggests that passing mangle_dupe_cols=False to read_csv() will change this behavior to one where it’ll overwrite data on load. That doesn’t seem to be implemented as of this version:

>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet

However! Pandas doesn’t fundamentally disallow duplicate column names. There’s a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:

>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0  a  a  b
0  1  2  3
1  4  5  6

(Then again, maybe this isn’t so simple; there’s a funny “0” in there with the column headings and that seems weird.)

Problem description

I would like a native way to read CSV files with repeated headers. In my application, it’s literally so I can warn people about duplicated column headers. Yes, I could use python’s built-in csv module for this, but then I’m using two methods to read CSV files and it gets weird.

Since mangle_dupe_cols=False is not yet implemented, I might propose this behavior in this case.

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.2.7 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: 5.4.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

njvackcommented, Jan 25, 2018

Probably a slightly better “turn the first row into column headers” incantation:

df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

Also, I’ve upgraded to pandas 0.22 and all this works in that version as well.

0reactions

claresloggettcommented, Feb 16, 2022

While this doesn’t directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I’ve resorted to a pre-load of the header row with

pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()

and then any values with count > 1 can be reported to the user.

As a general issue though, mangle_dupe_cols still needs implementation by the looks of it (as of Pandas version 1.4.1).

Top Results From Across the Web

Allow duplicate columns in Pandas - python - Stack Overflow

Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc...

Duplicate Labels — pandas 1.5.2 documentation

Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing at...

[Solved]-Allow duplicate columns in Pandas-Pandas,Python

Coding example for the question Allow duplicate columns in Pandas-Pandas ... Read csv df = pd.read_csv(csv_file_loc) # Get column names from csv file...

Read a delimited file (including CSV and TSV) into a tibble

They're useful for reading the most common types of flat file data, comma separated ... Duplicate column names will generate a warning and...

A Few Times, I've Broken Pandas. Here is one scenario. I ...

Other statistical languages more stringently guard against duplicate column names. Pandas, however, can be tricked into allowing duplicate column names.