TYP: how to annotate DataFrame.__getitem__
See original GitHub issueI was about to create a PR but I realized that annotating DataFrame.__getitem__
might be impossible without making some simplifications.
There are two problems with __getitem__
:
- We allow any
Hashable
key (one would expect that this always returns aSeries
) butslice
(isHashable
) returns aDataFrame
- Columns can be a multiindex,
df["a"
] can return aDataFrame
.
The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn’t exist. In practice, this will cover almost all cases.
I don’t think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex]
type, so we will always end up with incompatible & overlapping overloads.
The Hashable issue can partly be addressed:
# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
...
# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
...
# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any: # or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
...
Do you see a way to cover all cases of __getitem__
and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
I just pushed a PR for this and other mypy related issues in the MS stubs, and this is what I am using that works for the tests that are there:
I’ve spent a fair bit of time improving the annotations in the MS stubs, and developing out the test framework there. In particular, I just created a huge PR to make things work right with
mypy
. Once that is accepted (possibly in the next few days), the next steps would be to first bring over the testing mechanism and integrated it into the CI, and then bring over the actual stubs themselves.We’d have to figure out a way to manage the transition from MS publishing the MS stubs in
pylance
(and Visual Studio Code) releases (and no longer maintaining them) to us having a pandas release with pandas supported stubs. I think the easiest way to do this would be to have a separate project with these stubs that is a submodule to pandas and a submodule to the MS stubs (which have more stubs than just pandas). But that would require both teams to agree that is the (relatively short term) way of managing these stubs. I’m open to other suggestions.