Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TYP: how to annotate DataFrame.getitem

See original GitHub issue

I was about to create a PR but I realized that annotating DataFrame.__getitem__ might be impossible without making some simplifications.

There are two problems with __getitem__:

We allow any Hashable key (one would expect that this always returns a Series) but slice (is Hashable) returns a DataFrame
Columns can be a multiindex, df["a"] can return a DataFrame.

The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn’t exist. In practice, this will cover almost all cases.

I don’t think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex] type, so we will always end up with incompatible & overlapping overloads.

The Hashable issue can partly be addressed:

# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
    ...

# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
    ...

# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any:  #  or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
    ...

Do you see a way to cover all cases of __getitem__ and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

Dr-Irvcommented, Apr 2, 2022

I just pushed a PR for this and other mypy related issues in the MS stubs, and this is what I am using that works for the tests that are there:

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...
    @overload
    def __getitem__(self, rows: slice) -> DataFrame: ...
    @overload
    def __getitem__(
        self,
        idx: Union[
            Tuple,
            Series[_bool],
            DataFrame,
            List[_str],
            List[Hashable],
            Index,
            np_ndarray_str,
            Sequence[Tuple[Scalar, ...]],
        ],
    ) -> DataFrame: ...

0reactions

Dr-Irvcommented, Apr 6, 2022

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I’m looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

I’ve spent a fair bit of time improving the annotations in the MS stubs, and developing out the test framework there. In particular, I just created a huge PR to make things work right with mypy . Once that is accepted (possibly in the next few days), the next steps would be to first bring over the testing mechanism and integrated it into the CI, and then bring over the actual stubs themselves.

We’d have to figure out a way to manage the transition from MS publishing the MS stubs in pylance (and Visual Studio Code) releases (and no longer maintaining them) to us having a pandas release with pandas supported stubs. I think the easiest way to do this would be to have a separate project with these stubs that is a submodule to pandas and a submodule to the MS stubs (which have more stubs than just pandas). But that would require both teams to agree that is the (relatively short term) way of managing these stubs. I’m open to other suggestions.