question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FR: Allow duplicate column names in pandas.read_csv

See original GitHub issue

Right now, pandas’s read_csv() supports forcing column names read from CSV data to be unique:

>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']

The documentation suggests that passing mangle_dupe_cols=False to read_csv() will change this behavior to one where it’ll overwrite data on load. That doesn’t seem to be implemented as of this version:

>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet

However! Pandas doesn’t fundamentally disallow duplicate column names. There’s a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:

>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0  a  a  b
0  1  2  3
1  4  5  6

(Then again, maybe this isn’t so simple; there’s a funny “0” in there with the column headings and that seems weird.)

Problem description

I would like a native way to read CSV files with repeated headers. In my application, it’s literally so I can warn people about duplicated column headers. Yes, I could use python’s built-in csv module for this, but then I’m using two methods to read CSV files and it gets weird.

Since mangle_dupe_cols=False is not yet implemented, I might propose this behavior in this case.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.2.7 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: 5.4.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
njvackcommented, Jan 25, 2018

Probably a slightly better “turn the first row into column headers” incantation:

df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

Also, I’ve upgraded to pandas 0.22 and all this works in that version as well.

0reactions
claresloggettcommented, Feb 16, 2022

While this doesn’t directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I’ve resorted to a pre-load of the header row with

pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()

and then any values with count > 1 can be reported to the user.

As a general issue though, mangle_dupe_cols still needs implementation by the looks of it (as of Pandas version 1.4.1).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Allow duplicate columns in Pandas - python - Stack Overflow
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc...
Read more >
Duplicate Labels — pandas 1.5.2 documentation
Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing at...
Read more >
[Solved]-Allow duplicate columns in Pandas-Pandas,Python
Coding example for the question Allow duplicate columns in Pandas-Pandas ... Read csv df = pd.read_csv(csv_file_loc) # Get column names from csv file...
Read more >
Read a delimited file (including CSV and TSV) into a tibble
They're useful for reading the most common types of flat file data, comma separated ... Duplicate column names will generate a warning and...
Read more >
A Few Times, I've Broken Pandas. Here is one scenario. I ...
Other statistical languages more stringently guard against duplicate column names. Pandas, however, can be tricked into allowing duplicate column names.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found