question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MultiIndex row indexing with .loc fail with tuple but work with list of indices

See original GitHub issue

Code Sample, a copy-pastable example if possible


data = {"ID1": [1, 1, 1, 2, 2],
        "ID2": [1001, 1001, 1002, 1001, 1002],
        "ID3": [1, 2, 1, 1, 2],
        "Value": [1, 2, 9, 3, 4]}

df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted

print(df)

Out[3]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9
2   1001 1        3
    1002 2        4

Problem description

Now, extracting the desired rows with loc fails here while returning only the first row:

In [5]: df.loc[desired_rows, :]
Out[5]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

Expected Output

One solution would be to convert the tuple to a list internally because a list of indices work correctly:

In [6]: df.loc[list(desired_rows), :]
Out[6]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1002 2        4

Another solution is to raise an error if a tuple of indices is provided as the row indexer of the loc in order to prevent unpredicted results.

Output of pd.show_versions()

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
toobazcommented, Feb 1, 2018

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

You could maybe start by stating that MultiIndex keys take the form of tuples, then you could swap the first two examples currently provided (move the complete indexing one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar' stands for than ('bar',)). Finally, I think a warning box could then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (for MultiIndexes, and not only).

You might want to show examples of the fact that lists of tuples in general refer to multiple complete (MultiIndex) keys, while tuples of lists in general refer to multiple values on each, that is something like

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

Asides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.

1reaction
jorisvandenbosschecommented, Feb 1, 2018

I usually don’t distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

Yes, this is one of the gotcha’s due to the complexity of MultiIndexing that we somehow need to distinguish between both. And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas multi index partial selection using list of tuples
A little bit over thinking since we need the multiple index dataframe inputtuple =pd.DataFrame([('c1', 'd1'), ('c2', 'd2')],columns = ['cid' ...
Read more >
MultiIndex / advanced indexing — pandas 1.2.0 documentation
The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways...
Read more >
Hierarchical Indexing | Python Data Science Handbook
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have....
Read more >
How do I use the MultiIndex in pandas? - YouTube
One of the most powerful features in pandas is multi-level indexing (or "hierarchical indexing "), which allows you to add extra dimensions ...
Read more >
SettingWithCopyWarning in Pandas: Views vs Copies
The list attached to the variable index contains the strings "a" ... The assignment fails because df.loc[mask] returns a new DataFrame with a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found