Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only
See original GitHub issueMinimal Verifiable Working Example
Bellow you will find a Minimal Verifiable Working Example that reproduces the behaviour I am considering in this issue:
import pandas as pd
# JSON Dump for MWVE:
txt = """[{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null}]"""
# Load Data:
df = pd.read_json(txt)
# Filling NaN with string works as expected but downcast column types:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
# Trying to pivot data using MultiIndex and keeping columns of NaN produces all combinations of modalities:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)
Trial input looks like (df
):
channelid | measurandkey | networkkey | sitekey | timestamp | userfloatvalue | |
---|---|---|---|---|---|---|
0 | 5069 | Cu | HMT | 01MEU1 | 2018-01-01 | NaN |
1 | 5069 | Cu | HMT | 01MEU1 | 2018-01-02 | NaN |
2 | 5069 | Cu | HMT | 01MEU1 | 2018-01-03 | NaN |
3 | 5069 | Cu | HMT | 01MEU1 | 2018-01-04 | NaN |
4 | 5119 | Cu | HMT | 01AND3 | 2018-01-01 | NaN |
5 | 5119 | Cu | HMT | 01AND3 | 2018-01-02 | NaN |
6 | 5119 | Cu | HMT | 01AND3 | 2018-01-03 | NaN |
7 | 5119 | Cu | HMT | 01AND3 | 2018-01-04 | NaN |
8 | 5120 | Pb | HMT | 01MEU1 | 2018-01-01 | NaN |
9 | 5120 | Pb | HMT | 01MEU1 | 2018-01-02 | NaN |
10 | 5120 | Pb | HMT | 01MEU1 | 2018-01-03 | NaN |
11 | 5120 | Pb | HMT | 01MEU1 | 2018-01-04 | NaN |
12 | 5233 | Pb | HMT | 01AND3 | 2018-01-01 | NaN |
13 | 5233 | Pb | HMT | 01AND3 | 2018-01-02 | NaN |
14 | 5233 | Pb | HMT | 01AND3 | 2018-01-03 | NaN |
15 | 5233 | Pb | HMT | 01AND3 | 2018-01-04 | NaN |
Misbehaved output looks like (cross3
):
channelid | 5069 | 5119 | 5120 | 5233 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
networkkey | HMT | HMT | HMT | HMT | ||||||||||||
sitekey | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 | 01MEU1 | ||||||||
measurandkey | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb | Cu | Pb |
timestamp | ||||||||||||||||
2018-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2018-01-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2018-01-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2018-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Expected output is similar to cross2
but with NaN
value instead of string and looks like:
channelid | 5069 | 5119 | 5120 | 5233 |
---|---|---|---|---|
networkkey | HMT | HMT | HMT | HMT |
sitekey | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 |
measurandkey | Cu | Cu | Pb | Pb |
timestamp | ||||
2018-01-01 | nodata | nodata | nodata | nodata |
2018-01-02 | nodata | nodata | nodata | nodata |
2018-01-03 | nodata | nodata | nodata | nodata |
2018-01-04 | nodata | nodata | nodata | nodata |
Problem description
I have the need:
- to use MultiIndex in columns even if it is overdetermined (I mean, with less levels, index is still unique) and;
- to keep columns full of NaN because it means the channel lacks all its data.
What seems to be the problem, is the creation of all combination of level modalities (instead of keep the existing one only) which drastically increases the amount of Memory without necessity (those combinations are not present in original data).
Maybe it is a bug, maybe it is the designed behaviour. Just wanted to notice it because it has surprised me, and now I am looking to a clean way to circonvolve this behaviour.
How have I found it:
I first had a Memory Error
with small queries (about 1000 rows and 25 channels), then I reduced the amount of rows and columns, and I finally dumped it to JSON in order to get the following MVWE above.
Expected Output
To my understanding, the following command:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)
Should return the same as:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
cross2bis = cross2.replace('nodata', float('nan'))
A small DataFrame with no extra columns and NaN value not dropped, it should look like:
channelid | 5069 | 5119 | 5120 | 5233 |
---|---|---|---|---|
networkkey | HMT | HMT | HMT | HMT |
sitekey | 01MEU1 | 01AND3 | 01MEU1 | 01AND3 |
measurandkey | Cu | Cu | Pb | Pb |
timestamp | ||||
2018-01-01 | NaN | NaN | NaN | NaN |
2018-01-02 | NaN | NaN | NaN | NaN |
2018-01-03 | NaN | NaN | NaN | NaN |
2018-01-04 | NaN | NaN | NaN | NaN |
Without generating combination of level modalities that does not exists in input data.
This will also prevent raising a MemoryError
for reasonable amount of data.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:13 (4 by maintainers)
Top GitHub Comments
The docs at
/stable
are a bit out of date. Try http://pandas-docs.github.io/pandas-docs-travis/development/contributing.htmlOn Thu, Jan 24, 2019 at 8:22 AM Sarnath notifications@github.com wrote:
@jlandercy thanks for diving in! My first reflex was to say that this is due to limitations of
unstack
dealing with existing NaNs vs introduced NaNs due to unstacking (you can search for ‘unstack dropna’ in the issues to see some related discussions).But actually, it seems you can get your desired result (I think) with the underlying groupby + unstack:
Is it correct that this is what you want?