question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TYP: how to annotate DataFrame.__getitem__

See original GitHub issue

I was about to create a PR but I realized that annotating DataFrame.__getitem__ might be impossible without making some simplifications.

There are two problems with __getitem__:

  • We allow any Hashable key (one would expect that this always returns a Series) but slice (is Hashable) returns a DataFrame
  • Columns can be a multiindex, df["a"] can return a DataFrame.

The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn’t exist. In practice, this will cover almost all cases.

I don’t think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex] type, so we will always end up with incompatible & overlapping overloads.

The Hashable issue can partly be addressed:

# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
    ...

# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
    ...

# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any:  #  or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
    ...

Do you see a way to cover all cases of __getitem__ and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
Dr-Irvcommented, Apr 2, 2022

I just pushed a PR for this and other mypy related issues in the MS stubs, and this is what I am using that works for the tests that are there:

    @overload
    def __getitem__(self, idx: Scalar) -> Series: ...
    @overload
    def __getitem__(self, rows: slice) -> DataFrame: ...
    @overload
    def __getitem__(
        self,
        idx: Union[
            Tuple,
            Series[_bool],
            DataFrame,
            List[_str],
            List[Hashable],
            Index,
            np_ndarray_str,
            Sequence[Tuple[Scalar, ...]],
        ],
    ) -> DataFrame: ...
0reactions
Dr-Irvcommented, Apr 6, 2022

It was my understanding from the dev meetings a couple of months back that we were moving the MS Stubs across as is in the first instance?

I’m looking forward to that (especially if that is done sooner than later)! Maybe the question of how to annotate __getitem__ should be discussed at the MS stubs instead of here or when trying to consolidate the copied and inline annotations.

I’ve spent a fair bit of time improving the annotations in the MS stubs, and developing out the test framework there. In particular, I just created a huge PR to make things work right with mypy . Once that is accepted (possibly in the next few days), the next steps would be to first bring over the testing mechanism and integrated it into the CI, and then bring over the actual stubs themselves.

We’d have to figure out a way to manage the transition from MS publishing the MS stubs in pylance (and Visual Studio Code) releases (and no longer maintaining them) to us having a pandas release with pandas supported stubs. I think the easiest way to do this would be to have a separate project with these stubs that is a submodule to pandas and a submodule to the MS stubs (which have more stubs than just pandas). But that would require both teams to agree that is the (relatively short term) way of managing these stubs. I’m open to other suggestions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas-stubs — How we enhanced pandas with type ...
So we started using pyspark-stubs to alleviate the problem of missing type information (note that since Spark 3.0 this isn't necessary — type...
Read more >
How do I override __getitem__ in a TypedDict? - Stack Overflow
What can be done is to satisfy the " TypedDict with Optional values" type – by explicitly setting missing values to None ....
Read more >
Indexing and selecting data — pandas 1.5.2 documentation
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). See...
Read more >
4.5. DataFrame Getitem - Python
1. SetUp¶. ✘ >>> import pandas as pd >>> import numpy as np >>> np.random.seed(0) >>> >>> >>> df = pd. · 2....
Read more >
PEP 484 – Type Hints - Python Enhancement Proposals
There has now been enough 3rd party usage for static type analysis that the ... The Generic base class uses a metaclass that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found