ux: problems with ibis' `_` convenience API for deferred attribute resolution
See original GitHub issue#3804 introduced a nice way to be able chain several operations that use new columns not originally part of in input expression:
import pandas as pd
import ibis
from ibis import _
table = ibis.memtable(pd.DataFrame({"x": list("12213")}))
(
table.mutate(a=2 * _.x, b=_.x.cast("float64"))
.group_by(_.a)
.aggregate(count=_.b.count(), mean_b=_.b.mean())
.order_by(_.mean_b)
).execute()
This makes it very easy to complex logic in functions that add or rename columns and the use the .pipe
method to chain them in particular.
The choice of _
as the name of the importable deferred attribute resolver is nice because it leads to quite concise and readable code. However I also find it problematic from an UX point of view for the following reasons:
a)
it’s not easily googleable: someone not familiar with this idiom will have a hard time googling or stackoverflowing for it.b)
it can conflict with the_
variable of Jupyter notebooks (that store the value of the last executed cell) and can therefore lead to very confusing error messages for the unsuspecting users that do exploratory data analysis with Ibis in a Jupyter notebook with multiple cells.c)
even in regular, non-interactive Python code it can conflict with the common idiom of assigning ignored function call results to a dummy_
variable to express that we do not need a variable for an ancillary value. E.g.:
a, _ = function_returning_a_pair_of_values() # ignore the second value
for _ in range(n_trials):
if attempt():
break
else:
raise Exception(f"{n_trials} consecutive failed attempts")
Possible solutions
- Solution for
a)
only could be to change the default name of the singleton while allowing to keep on using the_
idiom later in the code:
from ibis import deferred_resolver as _
# continue as previously
At least people reading this code for the first time will understand that _
is some kind of deferred attribute resolver just by reading the code: no complex googling or random documentation scanning.
I don’t have good solutions for b)
and c)
. Here are ideas:
Suggest (by convention) to name the resolver c
instead of _
I picked up c
for “column” because most of the time the attribute lookup matches a column look-up. However a variable named c
might still frequently conflict with user code. Adapting the original example would be lead to code that looks like
import pandas as pd
import ibis
from ibis import deferred_resolver as c
table = ibis.memtable(pd.DataFrame({"x": list("12213")}))
(
table.mutate(a=2 * c.x, b=c.x.cast("float64"))
.group_by(c.a)
.aggregate(count=c.b.count(), mean_b=c.b.mean())
.order_by(c.mean_b)
).execute()
or we could be even more creative with the 𓅠 unicode symbol for the ibis (bird) hieroglyph:
import pandas as pd
import ibis
from ibis import deferred_resolver as 𓅠
table = ibis.memtable(pd.DataFrame({"x": list("12213")}))
(
table.mutate(a=2 * 𓅠.x, b=𓅠.x.cast("float64"))
.group_by(𓅠.a)
.aggregate(count=𓅠.b.count(), mean_b=𓅠.b.mean())
.order_by(𓅠.mean_b)
).execute()
The latter suggestion is more like a joke because:
- it can break code editors / readers that do not render non-ascii unicode symbols properly;
- it’s cumbersome to type such code without setting some kind of OS-level custom keyboard mapping / user defined code snippets.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
Hey @ogrisel, these are good points, thanks for bringing them up. I agree that
_
is problematic for the reasons you mentioned. (Thanks also @jmckk for chiming in.)To add some other possible approaches, let me say that we’ve been considering similar functionality for join expressions, which need to refer to both the left and the right tables being joined. The obvious choice here are something like
L
andR
(or_L
orL_
); this might suggest an equivalentC
(orX
) as a replacement for_
. AFAIK these single capital letter symbols wouldn’t conflict with any mainstream Python idioms or platforms.This still doesn’t make them searchable, but we want to keep the identifier extremely short anyway; one or two characters at most, so maybe this is not possible. Although we should come up with a memorable and searchable name that is part of its docstring and mentioned everywhere we mention this feature.
A Unicode character is clever but I agree, it’s a non-starter if it can’t be typed on a standard keyboard in every locale 😃
As to your side UX problem, in my opinion, the repr for an Ibis expression in general should generate an identical (or equivalent) Ibis expression string, like a quine. This would open up some really interesting use-cases, but is also a large amount of work. We might be able to do something more helpful in this smaller case, which may be easier and also push us in this more general direction.
Interesting. I suppose that the type hints will make this work even when no interactive Python shell backs the code editor (e.g. editing a
.py
file in VS Code, and not just for.ipynb
files).However, that won’t solve the problem of suggesting the right column names in when typing expressions in a chained expressions with previously added column names (e.g. via
mutate
oragg
).