Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MultiIndex row indexing with .loc fail with tuple but work with list of indices

See original GitHub issue

Code Sample, a copy-pastable example if possible


data = {"ID1": [1, 1, 1, 2, 2],
        "ID2": [1001, 1001, 1002, 1001, 1002],
        "ID3": [1, 2, 1, 1, 2],
        "Value": [1, 2, 9, 3, 4]}

df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted

print(df)

Out[3]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9
2   1001 1        3
    1002 2        4

Problem description

Now, extracting the desired rows with loc fails here while returning only the first row:

In [5]: df.loc[desired_rows, :]
Out[5]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

Expected Output

One solution would be to convert the tuple to a list internally because a list of indices work correctly:

In [6]: df.loc[list(desired_rows), :]
Out[6]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1002 2        4

Another solution is to raise an error if a tuple of indices is provided as the row indexer of the loc in order to prevent unpredicted results.

Output of `pd.show_versions()`

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Comments:14 (13 by maintainers)

Top GitHub Comments

2reactions

toobazcommented, Feb 1, 2018

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

You could maybe start by stating that MultiIndex keys take the form of tuples, then you could swap the first two examples currently provided (move the complete indexing one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar' stands for than ('bar',)). Finally, I think a warning box could then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (for MultiIndexes, and not only).

You might want to show examples of the fact that lists of tuples in general refer to multiple complete (MultiIndex) keys, while tuples of lists in general refer to multiple values on each, that is something like

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

Asides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.

1reaction

jorisvandenbosschecommented, Feb 1, 2018

I usually don’t distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

Yes, this is one of the gotcha’s due to the complexity of MultiIndexing that we somehow need to distinguish between both. And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).