Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider using polars instead of pandas.

See original GitHub issue

For a faster experience than pandas, polars is a good option. It already has a much better syntax than Pandas and has very fast groubpy (and other operations).

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Lazy | eager execution

Multi-threaded

SIMD

Query optimization

Powerful expression API

Rust | Python | …

To learn more, read the User Guide.



>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# embarrassingly parallel execution
# very expressive query language
>>> (
...     df
...     .sort("fruits")
...     .select(
...         [
...             "fruits",
...             "cars",
...             pl.lit("fruits").alias("literal_string_fruits"),
...             pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...             pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),     # groups by "cars"
...             pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),                         # groups by "fruits"
...             pl.col("A").reverse().over("fruits").flatten().alias("rev_A_by_fruits"),           # groups by "fruits
...             pl.col("A").sort_by("B").over("fruits").flatten().alias("sort_A_by_B_by_fruits"),  # groups by "fruits"
...         ]
...     )
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

https://github.com/pola-rs/polars/

For a more R like function names, you can take a look at tidypolars:

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users. https://github.com/markfairbanks/tidypolars

tidypolars benchmark:

┌─────────────┬────────────┬─────────┬──────────┐
│ func_tested ┆ tidypolars ┆ polars  ┆ pandas   │
│ ---         ┆ ---        ┆ ---     ┆ ---      │
│ str         ┆ f64        ┆ f64     ┆ f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange     ┆ 190.345    ┆ 169.478 ┆ 500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when   ┆ 87.348     ┆ 79.427  ┆ 152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct    ┆ 16.888     ┆ 16.282  ┆ 28.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter      ┆ 29.789     ┆ 29.91   ┆ 231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join   ┆ 236.784    ┆ 231.283 ┆ 1042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join  ┆ 49.71      ┆ 47.563  ┆ 630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join   ┆ 113.792    ┆ 115     ┆ 1100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate      ┆ 7.979      ┆ 7.408   ┆ 117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider ┆ 42.764     ┆ 39.939  ┆ 49.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize   ┆ 59.434     ┆ 58.011  ┆ 453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:13 (7 by maintainers)

Top GitHub Comments

3reactions

machowcommented, Jun 1, 2022

Just a quick note–had a nice discussion with @has2k1 (author of https://github.com/has2k1/plotnine), and he emphasized interest / the value of integrating with polars. We’ve set up some time to pair and try it out a bit more 😃

3reactions

machowcommented, Jan 20, 2022

Background

Just a quick update from kicking the tires. I think polars is very similar to SQL / pandas, where essentially:

group by is creates an aggregation context (e.g. everything that happens should be an agg)
over is used to do grouping inside filters and selects (what dplyr calls mutate)

For context, dplyr is a bit funky in that its group_by basically sets the grouping for both cases above.

Here are some resources if anyone is curious about how dplyr’s grammar / approach works:

a rstudio global talk I did comparing dplyr grammar to pandas (20 mins)
Alex Hayes many models blog post - making each row of a series a model, then turning each model into a DataFrame, then unnesting DataFrames.
Dave Robinson’s 10 tremendous tricks of the tidyverse (20 mins). Highlights key moves from a year of live data analysis.

Example

So in this case you can do the fast filter this way…

# setup
import numpy as np
import pandas as pd

np.random.seed(123)
df_students = pd.DataFrame({
    'student_id': np.repeat(np.arange(2000), 10),
    'course_id': np.random.randint(1, 20, 20000),
    'score': np.random.randint(1, 100, 20000)
})

# siuba ----
from siuba import _, group_by
from siuba.experimental.pd_groups import fast_filter

df_students >> group_by(_.student_id) >> fast_filter(_.score == _.score.min())


# pandas ----
df_students.loc[
    lambda d: d["score"] == d.groupby("student_id")["score"].transform("min")
]


# polars ----
import polars as pl
pl_students = pl.from_pandas(df_students)
pl_students.filter(pl.col("score") == pl.col("score").min().over("student_id"))

A huge benefit of polars here is that–like SQL engines–it can optimize queries using e.g. predicate pushdown. I think similar to pandas, deeply custom groupby + apply will be much slower than dplyr (and dplyr will be slower than non custom groupby stuff in polars, etc…).

I don’t really know what will give people the same level of arbitrary groupby + apply as R, but am super interested / curious to explore polars (also relieved to not have indexes in the user facing api 😃.