question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow columns to "pass through" summarize? (e.g. across)

See original GitHub issue

I have another question which can really optimize the way I work: often I’m performing calculations on aggregates and would like to allow some features (that are constant within the group) to pass through after the summarize. I know it’s possible to create new variables in the sense of summarize(new_col=_.feature.mean(), old_col=_.old_col.iloc[0]), for example, but this gets tedious if there are many columns (or even with a few columns).

Is there a way to tell siuba (more specifically summarize) to pass through some variables? And on a related note - is there a way to make the same operation on many columns without having to use gather? (Currently I have the process of gather -> group_by -> summarize -> spread to operate on many same columns)?

Thanks for the awesome library! Omri

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
machowcommented, Jun 21, 2021

Hey, sorry for the delay–I’ve been thinking about how across could be implemented. It seems like, similar to siuba’s implementation of case_when(), across() could essentially take data as its first argument (verbs do this too. e.g. select or mutate).

Here’s a case_when example (since apparently it is undocumented 😬).

from siuba.data import mtcars
from siuba import case_when

# outputs numpy array: array(['> 4 cyl', '> 4 cyl', 'other', ...]))
case_when(mtcars, {_.cyl > 4: "> 4 cyl", True: "other"})

# outputs a Symbolic expression
case_when(_, {}) 

# note that case_when works in SQL backends too!

Across proposal

Essentially what could happen is:

  1. across takes data as its first argument (e.g. across(_, ...), across(mtcars, ...))
  • its other args are like dplyr: column selection, functions to apply, etc…
  • it returns a DataFrame(GroupBy)
  1. across(_, _.contains('abc'), _.mean(), ...) within verbs will just get evaluated like other symbolic calls
  • down the road we can probably omit the first _
  1. functions like mutate and summarize will need to be able to handle when evaluated arguments are DataFrame(GroupBy). This would be super handy to add anyway!

Examples

from siuba.data import mtcars
from siuba import _, across

# all the classic selection options
across(mtcars, [_.mpg, _.hp], _.mean())
across(mtcars, _.contains("mpg"), _.mean())
across(_, _[3:5], _.mean())

# with summarize
mtcars >> summarize(across(_, [_.mpg, _.hp], _.mean()))

# an implication of summarize accepting DataFrames as arguments is
# that you can do this.
mtcars >> summarize(_[["mpg", "hp"]].mean())

# or this, which would override mpg and hp to be + 1
mtcars >> mutate(_[["mpg", "hp"]] + 1)

# here is analogous dplyr code. TODO, is the overriding behavior mentioned
# explicitly in dplyr docs?
mutate(mtcars, mtcars[c("hp", "mpg")] + 1)
0reactions
moeketsimscommented, Dec 6, 2021

does siuba have across verb?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apply a function (or functions) across multiple columns - dplyr
Functions to apply to each of the selected columns. Possible values are: A function, e.g. mean . A purrr-style lambda, e.g. ~ mean(....
Read more >
Using dplyr summarize with different operations for multiple ...
As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example...
Read more >
All the secrets of SUMMARIZE - SQLBI
Consequently, SUMMARIZE does not work across limited relationships, including both many-to-many cross-filter relationships and cross-island relationships.
Read more >
Summarize Tool | Alteryx Help
One Tool Example · Return the sum for a column of data. · Return the minimum or maximum value in a column. ·...
Read more >
applying dplyr functions simultaneously across multiple columns
The new across() function turns all dplyr functions into “scoped” versions ... Let's look at an example of summarizing the columns using a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found