MultiIndex row indexing with .loc fail with tuple but work with list of indices
See original GitHub issueCode Sample, a copy-pastable example if possible
data = {"ID1": [1, 1, 1, 2, 2],
"ID2": [1001, 1001, 1002, 1001, 1002],
"ID3": [1, 2, 1, 1, 2],
"Value": [1, 2, 9, 3, 4]}
df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted
print(df)
Out[3]:
Value
ID1 ID2 ID3
1 1001 1 1
2 2
1002 1 9
2 1001 1 3
1002 2 4
Problem description
Now, extracting the desired rows with loc
fails here while returning only the first row:
In [5]: df.loc[desired_rows, :]
Out[5]:
Value
ID1 ID2 ID3
1 1001 2 2
Expected Output
One solution would be to convert the tuple
to a list
internally because a list
of indices work correctly:
In [6]: df.loc[list(desired_rows), :]
Out[6]:
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1002 2 4
Another solution is to raise an error if a tuple
of indices is provided as the row indexer of the loc
in order to prevent unpredicted results.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (13 by maintainers)
Top GitHub Comments
Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index
You could maybe start by stating that
MultiIndex
keys take the form of tuples, then you could swap the first two examples currently provided (move the complete indexing one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar'
stands for than('bar',)
). Finally, I think a warning box could then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (forMultiIndex
es, and not only).You might want to show examples of the fact that lists of tuples in general refer to multiple complete (
MultiIndex
) keys, while tuples of lists in general refer to multiple values on each, that is something likeAsides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.
Yes, this is one of the gotcha’s due to the complexity of MultiIndexing that we somehow need to distinguish between both. And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).