question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas pivot_table MultiIndex and dropna=False generates all combinations of modalities instead of keeping existing one only

See original GitHub issue

Minimal Verifiable Working Example

Bellow you will find a Minimal Verifiable Working Example that reproduces the behaviour I am considering in this issue:

import pandas as pd
# JSON Dump for MWVE:
txt = """[{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5069,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5119,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Cu","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5120,"networkkey":"HMT","sitekey":"01MEU1","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514764800000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514851200000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1514937600000,"userfloatvalue":null},{"channelid":5233,"networkkey":"HMT","sitekey":"01AND3","measurandkey":"Pb","timestamp":1515024000000,"userfloatvalue":null}]"""
# Load Data:
df = pd.read_json(txt)
# Filling NaN with string works as expected but downcast column types:
cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
# Trying to pivot data using MultiIndex and keeping columns of NaN produces all combinations of modalities:
cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)

Trial input looks like (df):

channelid measurandkey networkkey sitekey timestamp userfloatvalue
0 5069 Cu HMT 01MEU1 2018-01-01 NaN
1 5069 Cu HMT 01MEU1 2018-01-02 NaN
2 5069 Cu HMT 01MEU1 2018-01-03 NaN
3 5069 Cu HMT 01MEU1 2018-01-04 NaN
4 5119 Cu HMT 01AND3 2018-01-01 NaN
5 5119 Cu HMT 01AND3 2018-01-02 NaN
6 5119 Cu HMT 01AND3 2018-01-03 NaN
7 5119 Cu HMT 01AND3 2018-01-04 NaN
8 5120 Pb HMT 01MEU1 2018-01-01 NaN
9 5120 Pb HMT 01MEU1 2018-01-02 NaN
10 5120 Pb HMT 01MEU1 2018-01-03 NaN
11 5120 Pb HMT 01MEU1 2018-01-04 NaN
12 5233 Pb HMT 01AND3 2018-01-01 NaN
13 5233 Pb HMT 01AND3 2018-01-02 NaN
14 5233 Pb HMT 01AND3 2018-01-03 NaN
15 5233 Pb HMT 01AND3 2018-01-04 NaN

Misbehaved output looks like (cross3):

channelid 5069 5119 5120 5233
networkkey HMT HMT HMT HMT
sitekey 01AND3 01MEU1 01AND3 01MEU1 01AND3 01MEU1 01AND3 01MEU1
measurandkey Cu Pb Cu Pb Cu Pb Cu Pb Cu Pb Cu Pb Cu Pb Cu Pb
timestamp
2018-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Expected output is similar to cross2 but with NaN value instead of string and looks like:

channelid 5069 5119 5120 5233
networkkey HMT HMT HMT HMT
sitekey 01MEU1 01AND3 01MEU1 01AND3
measurandkey Cu Cu Pb Pb
timestamp
2018-01-01 nodata nodata nodata nodata
2018-01-02 nodata nodata nodata nodata
2018-01-03 nodata nodata nodata nodata
2018-01-04 nodata nodata nodata nodata

Problem description

I have the need:

  • to use MultiIndex in columns even if it is overdetermined (I mean, with less levels, index is still unique) and;
  • to keep columns full of NaN because it means the channel lacks all its data.

What seems to be the problem, is the creation of all combination of level modalities (instead of keep the existing one only) which drastically increases the amount of Memory without necessity (those combinations are not present in original data).

Maybe it is a bug, maybe it is the designed behaviour. Just wanted to notice it because it has surprised me, and now I am looking to a clean way to circonvolve this behaviour.

How have I found it:

I first had a Memory Error with small queries (about 1000 rows and 25 channels), then I reduced the amount of rows and columns, and I finally dumped it to JSON in order to get the following MVWE above.

Expected Output

To my understanding, the following command:

cross3 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", dropna=False)

Should return the same as:

cross2 = df.pivot_table(index="timestamp", columns=["channelid", "networkkey", "sitekey", "measurandkey"], values="userfloatvalue", aggfunc="first", fill_value="nodata")
cross2bis = cross2.replace('nodata', float('nan'))

A small DataFrame with no extra columns and NaN value not dropped, it should look like:

channelid 5069 5119 5120 5233
networkkey HMT HMT HMT HMT
sitekey 01MEU1 01AND3 01MEU1 01AND3
measurandkey Cu Cu Pb Pb
timestamp
2018-01-01 NaN NaN NaN NaN
2018-01-02 NaN NaN NaN NaN
2018-01-03 NaN NaN NaN NaN
2018-01-04 NaN NaN NaN NaN

Without generating combination of level modalities that does not exists in input data. This will also prevent raising a MemoryError for reasonable amount of data.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:4
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Jan 24, 2019

The docs at /stable are a bit out of date. Try http://pandas-docs.github.io/pandas-docs-travis/development/contributing.html

On Thu, Jan 24, 2019 at 8:22 AM Sarnath notifications@github.com wrote:

https://pandas.pydata.org/pandas-docs/stable/contributing.html I will go with this. Let me know if something else is expected.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/18030#issuecomment-457213539, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIpNh-lo-Lkffo7n_AIs3Rw-a3UB6ks5vGcGNgaJpZM4QK-7n .

1reaction
jorisvandenbosschecommented, Oct 30, 2017

@jlandercy thanks for diving in! My first reflex was to say that this is due to limitations of unstack dealing with existing NaNs vs introduced NaNs due to unstacking (you can search for ‘unstack dropna’ in the issues to see some related discussions).

But actually, it seems you can get your desired result (I think) with the underlying groupby + unstack:

In [83]: df.groupby(["timestamp", "channelid", "networkkey", "sitekey", "measurandkey"]).agg('first').unstack([1,2,3,4])
Out[83]: 
             userfloatvalue                     
channelid              5069   5119   5120   5233
networkkey              HMT    HMT    HMT    HMT
sitekey              01MEU1 01AND3 01MEU1 01AND3
measurandkey             Cu     Cu     Pb     Pb
timestamp                                       
2018-01-01              NaN    NaN    NaN    NaN
2018-01-02              NaN    NaN    NaN    NaN
2018-01-03              NaN    NaN    NaN    NaN
2018-01-04              NaN    NaN    NaN    NaN

Is it correct that this is what you want?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reshaping and pivot tables — pandas 1.5.2 documentation
In this case, consider using pivot_table() which is a generalization of pivot that can handle duplicate values for one index/column pair. Reshaping by...
Read more >
Generate all combinations in pandas - python - Stack Overflow
You want to make a pivot table. This is done in Pandas with the pandas. pivot_table(data, values=None, index=None, columns=None, aggfunc='mean' ...
Read more >
Functions That Generate a Multi-index in Pandas and How to ...
A multiindex is when there is more than one index. Other names are multiple index and hierarchical index. Image by author. Multiindex can...
Read more >
Pandas Pivot Table Column With Empty Value Do Not Show
Pandas pivot_table MultiIndex and dropnaFalse generates all A small ... all combinations of modalities instead of keeping existing one only #18030. Open.
Read more >
Pivot Tables | Python Data Science Handbook
The pivot table takes simple column-wise data as input, and groups the ... three of every four females on board survived, while only...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found