question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Join result takes the index order of the other (right) DataFrame instead of the calling's (left) one

See original GitHub issue

Code Sample, a copy-pastable example if possible

# Your code here
df1 = pd.DataFrame({'a': [0, 10, 20]})
df2 = pd.DataFrame({'b': [200, 100]}, index=[2,1])

print(df1.join(df2, how='inner'))
print(df2.join(df1, how='inner'))

print(df1.join(df2, how='inner', sort=True))

Problem description

Contrary to what is stated in the documentation of DataFrame.join(), when using the default sort=False, the return DataFrame preserves the index order of the other (right) DataFrame, instead of the index order of the calling (left) DataFrame.

Besides, the sort=True argument does not work.

Expected Output

The expected output is that the return DataFrame should preserve the index order of the calling (left) DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-108-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: ca_ES.UTF-8 LOCALE: ca_ES.UTF-8

pandas: 0.19.2 nose: 1.3.7 pip: 8.1.2 setuptools: 27.2.0.post20161106 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.4.6 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.3.1 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.3 lxml: 3.6.4 bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.42.0 pandas_datareader: 0.2.1

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, Mar 6, 2017

The bug is not at the DataFrame level but at the Index level.

The swap of arguments in the call to merge proposed by @sleepdeprivation is going to generate a new bug with swapped lsuffix and rsuffix. Moreover the order of the columns in the result is also exchanged (first columns from the right DataFrame, then columns from the left one).

The real bug is at the Index level, concretely in the Index.intersection() method. The get_indexer must be called on the right index and with left index as argument, so that it returns a mask which transforms the caller index (right) into the passed index (left), and not the other way around. This is the desired output: a mask on the right index which picks its elements in such a way that it gets aligned with the left index.

I have implemented/updated some tests in my PR.

1reaction
jrebackcommented, Mar 5, 2017

pandas/tests/tools/test_join.py (there is a single sorted test now)

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.DataFrame.join — pandas 1.5.2 documentation
Efficiently join multiple DataFrame objects by index at once by passing a list. Parameters. otherDataFrame, Series, or a list containing any combination of...
Read more >
Learn to Merge and Join DataFrames with Pandas and Python
The Pandas merge() command takes the left and right dataframes, matches rows based on the “on” columns, and performs different types of merges...
Read more >
Merge two dataframes by index - python - Stack Overflow
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=.
Read more >
Combining Data in Pandas With merge(), .join(), and concat()
join () joins on indices and doesn't directly merge DataFrames, all columns—even those with matching names—are retained in the resulting ...
Read more >
Combining Datasets: Merge and Join | Python Data Science ...
The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found