DISCUSSION: Add format parameter to .astype when converting to str dtype
See original GitHub issueI propose adding a string formatting possibility to .astype
when converting to str
dtype: I think it’s reasonable to expect that you can choose the string format when converting to a string dtype, as you’re basically freezing a representation of your series, and just using .astype(str)
for this is often too crude.
This possibility should take shape of a format
parameter to .astype
, that can take a string and can only be used when converting to string dtype. This would lessen the reliance on .apply
for converting non-strings to more complex strings and make such conversions more readable (IMO) and maybe faster (as we’re avoiding .apply
which is slow, though Im not too knowledgable on such optimizations).
The current procedure for converting to a complex string goes like this:
In [1] ser = pd.Series([-1, 1.234])
In [2] ser.apply("{:+.1f} $".format)
0 -1.0 $
1 +1.2 $
dtype: object
I propose to make this possible:
In [3] ser.astype(str, format="{:+.1f} $")
0 -1.0 $
1 +1.2 $
dtype: object
If the dtype
parameter is not str
, setting of the format
parameter should raise an exception. If format
is not set, the current behaviour will be used. The proposed change is therefore backward compatible.
Also to consider:
Allowing a placeholder name
Should a placeholder name be available? Then you could do:
In [4] ser = pd.Series(pd.date_range('2017-03', periods=2, freq='M'))
In [x] ser.astype(str, format="Y{value.dt.year}-Q{value.dt.quarter}")
0 Y2017-Q1
1 Y2017-Q2
dtype: object
(Note that we above have an implicit parameter on .astype
with a default value “value”, so adding a placeholder name is transparent. Note also the above behaviour is present in ser.dt.strftime, but please look at the principle rather than the concrete example).
A downside to allowing a placeholder name could be the potential for abuse (stuffing too much into the format string) and possibly losing the option to vectorize (though this is not my expertize).
Adding a .format
method
It could also be considered adding a .str.format
or .format
method to DataFrame/Series.
If .format
is added to the .str
namespace it would only be usable for string dataframes/series (which I’d be quite ok with, if the format
parameter is also available on .astype
for other data types).
Alternatively, such a method could be available directly on all DataFrames/Series. Then you’d do ser.format('{:+.1f}')
rather than ser.astype(str, format='{:+.1f}')
. IMO though, it would be inconsistent to have such a string conversion method directly on pandas objects, but not for other types. Why have .format
but not .to_numeric
as a dataframes/series method?
IMO therefore, astype(str, format=...)
combined with a .str.format
method is better than adding a new .format
method for this. So:
.astype(str, format=...)
makes it very obvious that we’re now changing to string datatype, and.str.format(...)
makes it clear that we’re doing a string manipulation.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:6
- Comments:19 (18 by maintainers)
Top GitHub Comments
ok, given the discussion we are having on #18347. more amenable to this.
This is something completely different. This converts the full dataframe to a string represenation, while here it is about converting values to formatted string values inside a dataframe
I like having some way to do this (but the question is indeed in what kind of API), but I would also be OK to end the discussion with the decision that it is not important enough to add specialized functionality and that using the
s.apply("{..} ..".format)
idiom is the recommended way here. But let’s at least have that discussion.