Consider using polars instead of pandas.
See original GitHub issueFor a faster experience than pandas, polars is a good option. It already has a much better syntax than Pandas and has very fast groubpy (and other operations).
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.
- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Rust | Python | β¦
To learn more, read the User Guide.
>>> import polars as pl
>>> df = pl.DataFrame(
... {
... "A": [1, 2, 3, 4, 5],
... "fruits": ["banana", "banana", "apple", "apple", "banana"],
... "B": [5, 4, 3, 2, 1],
... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
... }
... )
# embarrassingly parallel execution
# very expressive query language
>>> (
... df
... .sort("fruits")
... .select(
... [
... "fruits",
... "cars",
... pl.lit("fruits").alias("literal_string_fruits"),
... pl.col("B").filter(pl.col("cars") == "beetle").sum(),
... pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"), # groups by "cars"
... pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"), # groups by "fruits"
... pl.col("A").reverse().over("fruits").flatten().alias("rev_A_by_fruits"), # groups by "fruits
... pl.col("A").sort_by("B").over("fruits").flatten().alias("sort_A_by_B_by_fruits"), # groups by "fruits"
... ]
... )
... )
shape: (5, 8)
ββββββββββββ¬βββββββββββ¬βββββββββββββββ¬ββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ
β fruits β cars β literal_stri β B β sum_A_by_ca β sum_A_by_fr β rev_A_by_fr β sort_A_by_B β
β --- β --- β ng_fruits β --- β rs β uits β uits β _by_fruits β
β str β str β --- β i64 β --- β --- β --- β --- β
β β β str β β i64 β i64 β i64 β i64 β
ββββββββββββͺβββββββββββͺβββββββββββββββͺββββββͺββββββββββββββͺββββββββββββββͺββββββββββββββͺββββββββββββββ‘
β "apple" β "beetle" β "fruits" β 11 β 4 β 7 β 4 β 4 β
ββββββββββββΌβββββββββββΌβββββββββββββββΌββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββ€
β "apple" β "beetle" β "fruits" β 11 β 4 β 7 β 3 β 3 β
ββββββββββββΌβββββββββββΌβββββββββββββββΌββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββ€
β "banana" β "beetle" β "fruits" β 11 β 4 β 8 β 5 β 5 β
ββββββββββββΌβββββββββββΌβββββββββββββββΌββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββ€
β "banana" β "audi" β "fruits" β 11 β 2 β 8 β 2 β 2 β
ββββββββββββΌβββββββββββΌβββββββββββββββΌββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββ€
β "banana" β "beetle" β "fruits" β 11 β 4 β 8 β 1 β 1 β
ββββββββββββ΄βββββββββββ΄βββββββββββββββ΄ββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ
https://github.com/pola-rs/polars/
For a more R like function names, you can take a look at tidypolars:
tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users. https://github.com/markfairbanks/tidypolars
tidypolars benchmark:
βββββββββββββββ¬βββββββββββββ¬ββββββββββ¬βββββββββββ
β func_tested β tidypolars β polars β pandas β
β --- β --- β --- β --- β
β str β f64 β f64 β f64 β
βββββββββββββββͺβββββββββββββͺββββββββββͺβββββββββββ‘
β arrange β 190.345 β 169.478 β 500.112 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β case_when β 87.348 β 79.427 β 152.623 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β distinct β 16.888 β 16.282 β 28.725 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β filter β 29.789 β 29.91 β 231.397 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β full_join β 236.784 β 231.283 β 1042.689 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β inner_join β 49.71 β 47.563 β 630.98 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β left_join β 113.792 β 115 β 1100.607 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β mutate β 7.979 β 7.408 β 117.283 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β pivot_wider β 42.764 β 39.939 β 49.048 β
βββββββββββββββΌβββββββββββββΌββββββββββΌβββββββββββ€
β summarize β 59.434 β 58.011 β 453.707 β
βββββββββββββββ΄βββββββββββββ΄ββββββββββ΄βββββββββββ
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Coming from Pandas - Polars - User Guide
Polars represents data in memory with Arrow arrays while Pandas represents data in memory in Numpy arrays. Apache Arrow is an emerging standard...
Read more >Using the Polars DataFrame Library - CODE Magazine
Polars represents data internally using Apache Arrow arrays while Pandas stores data internally using NumPy arrays.
Read more >Pandas vs Polar - A look at performance - Studio Terabyte
We've looked at six different use cases varying in complexity. Each one made it clear that Polars is a lot more performant than...
Read more >Polars vs Pandas: what is more convenient? | by Ilia Ozhmegov
It can be seen that code for polars looks more readable in general: it takes transactions table t and users table u ,...
Read more >Polars vs Pandas - A look at performance : r/rust - Reddit
For Polars the Rust version has been chosen instead of the Python version to get the most performance. After all, when deciding what...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just a quick noteβhad a nice discussion with @has2k1 (author of https://github.com/has2k1/plotnine), and he emphasized interest / the value of integrating with polars. Weβve set up some time to pair and try it out a bit more π
Background
Just a quick update from kicking the tires. I think polars is very similar to SQL / pandas, where essentially:
For context, dplyr is a bit funky in that its group_by basically sets the grouping for both cases above.
Here are some resources if anyone is curious about how dplyrβs grammar / approach works:
Example
So in this case you can do the fast filter this wayβ¦
A huge benefit of polars here is thatβlike SQL enginesβit can optimize queries using e.g. predicate pushdown. I think similar to pandas, deeply custom groupby + apply will be much slower than dplyr (and dplyr will be slower than non custom groupby stuff in polars, etcβ¦).
I donβt really know what will give people the same level of arbitrary groupby + apply as R, but am super interested / curious to explore polars (also relieved to not have indexes in the user facing api π.