ENH: allow lists/sets in fillna
See original GitHub issueThe docs for the value
-parameter of fillna
say (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)
value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list. [my bold]
Frankly, I do not understand this limitation, especially because I see no way to interpret it ambiguously. There are several usecases (that I keep encountering) for filling in lists - especially empty lists.
One of the main ones in using .str.split(..., expand=False)
or str.extract
and wanting to keep processing the lists - e.g. turn them into sets:
s = pd.Series(['a,b,b,c', 'b,c,d,d,d', 1, None]).str.split(',')
s
# 0 [a, b, b, c]
# 1 [b, c, d, d, d]
# 2 NaN
# 3 NaN
# dtype: object
s.map(set) # this errors on NaNs
# TypeError: 'float' object is not iterable
### would like to use:
s.fillna([]).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
### same for
s.fillna(set()).map(set)
# TypeError: "value" parameter must be a scalar or dict, but you passed a "set"
It’s tedious to always do this by hand (esp. for DataFrame
), like
serieswithalongname.loc[serieswithalongname.isnull()] = [] # can't be chained either
### even this doesn't work, because pd.notnull retains dimensionality -> ambiguous boolean
s.map(lambda x: set(x) if pd.notnull(x) else set())
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
### meaning the next step gets even more unwieldy:
s.map(lambda x: set() if isinstance(x, (float, type(None))) and pd.isnull(x) else set(x))
# 0 {c, a, b}
# 1 {c, d, b}
# 2 {}
# 3 {}
# dtype: object
### work-around even more painful for DataFrames,
### because single-value broadcast doesn't work as before
df = s.to_frame('A')
df.loc[df.A.isnull(), 'A'] = []
# ValueError: cannot copy sequence with size 0 to array axis with dimension 1
### need to know to use this:
df.loc[df.A.isnull(), 'A'] = [[]]
df.applymap(set)
# A
# 0 {c, a, b}
# 1 {c, d, b}
# 2 {}
# 3 {}
This is also somewhat related to #19266.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:7 (5 by maintainers)
@TomAugspurger
Thanks for the input. I think this ambiguity is not so serious. First off
self[i] for i in range(len(value))
does not work for a DF, but more importantly, one can immediately see the effect that the whole list gets filled into every NaN. And with a quick look at the docs, it’s clear that a dictionary is needed to distinguish fill-values by column.Furthermore, since lists throw errors currently, allowing this would not break any existing code. I think it would be very useful…
I often have to deal with data where a table column contains a string with comma-separated values or NAs. So having a fix for this issue would be highly appreciated!
I painfully realized that
pd.Series(['a,b,b,c', None]).str.split(',').fillna([])
throws an error and
pd.Series(['a,b,b,c', None]).fillna('').str.split(',')
returns a list with an empty string but not an empty list. 😦
So I ended up with
pd.Series(['a,b,b,c', None]).str.split(',').map(lambda x: x if isinstance(x, list) else [])
which is not pretty but does the job.
I don’t know if this has a decent performance - and I don’t really care because all I need is to declare several data transformation in a chained fashion where each transformation can be easily expressed on a single line (without defining several other helper functions or whatsoever).