question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API: Consistent handling of duplicate input columns

See original GitHub issue

When loading data into pandas with pandas.read_X() methods, the behavior when duplicate columns exist changes depending on the format.

For read_csv, read_fwf and read_excel we have a mangle_dupe_cols parameter that we can provide. By default it appends .1, .2… to duplicated column names. Setting it to False raises an exception about not being implemented.

html also appends .1… but the option is not provided.

read_json(orient='split') loads data with the duplicate column names.

read_xml drops the columns if they are duplicated (I assume one columns keeps overwriting the previous with the same name).

Personally, I think we should have consistency among all them. What I would do is to control this with an option (e.g. io.duplicate_columns. Could also be an argument for all the read_ methods, but I think these methods have already too many arguments, and I think the number of cases when users want to change this to be small, and very unlikely that they want to have different ways of handling duplicate column names in different calls to read methods.

Whether it’s an option or an argument, we could allow the next options (feel free to propose better names):

  • raise: If duplicate column names exist, raise an exception
  • drop: Keep one (maybe the first) and ignore the rest
  • allow Load data with duplicate columns. Based on discussions in the data apis consortium and #13262, I’d add this for backward compatibility only, but we shouldn’t probably allow duplicate column names after a deprecation period. Or we can simply remove this option
  • {col}.{i}, {col}_{i}…: Allow appending an autonumeric with a custom format. By default, '{col}.{i}' could be used, as this seems to be the preferred way based on the current API. This would address #8908,

I think it’d be good to have a single function that receives the input column names and return the final column names (indices of columns to use may also be needed, for cases like drop), or raises when appropriate. And all read_ functions should use it if the format can have duplicate columns names.

Thoughts?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Jul 14, 2022

use raise (mirrors what we do in errors keyword)

1reaction
jrebackcommented, Jul 14, 2022

-1 on a global option

+1 in a single name across functions

we already use allow_duplicates=bool in various places; extending this would be good

Read more comments on GitHub >

github_iconTop Results From Across the Web

REST API: How to avoid duplicate resource creation on ...
REST API: How to avoid duplicate resource creation on concurrent requests · Method 1: Validation · Method 2: Locking · Method 3: Queuing...
Read more >
Handling Of Duplicate Data with Apache Pinot | by Manish soni
One way to handle the duplicate orders is by handling the flow with orderId, since orderId is a unique field, no two orders...
Read more >
Deduplicate Data - Trifacta Documentation
In the generated column, values that are true indicate duplicate data. If all values are true , then you can remove one of...
Read more >
How To Deal with Duplicate Entries Using SQL
A 3-step approach to tackle the problem of duplicates in databases ... Duplicates are a recurring problem for any database user. There are...
Read more >
php - How to handle error for duplicate entries? - Stack Overflow
When the user enters a value that already exists in the table, the MySQL error "Duplicate entry 'entered value' for key 1" is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found