BUG: setting raw=True in the Dataframe.apply function causes a ValueError
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})
def parse_node_row(row, index):
def unpack_bitmask(val):
return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]
arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()
return pd.Series(arr, index=index)
iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])
ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,))
Problem description
I am working on a dataset where values are bitmaskes, each 64bit integer is actually 4x16 (one per GPU per node). The format of the dataset is out of my control. So, I wrote the code above to create a new dataframe that created a multi index with the data split.
It works fine and all, but as I am solely using numpy functions in parse_node_row
, I added parse=True
in the apply
call. However, this causes my script to crash! The minimal reproducible example above shows
ValueError: Shape of passed values is (2, 8), indices imply (2, 2)
But runs just fine when raw=False
. What gives?
There is no mention in the docs of any side-effects of using raw=True besides you getting a numpy array in your apply function, which is completely fine for me.
Expected Output
More performance!
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None
pandas : 1.0.4 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.1.1 Cython : 0.29.20 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.9.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.1 matplotlib : 3.2.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:7 (4 by maintainers)
Top GitHub Comments
I’m not sure offhand.
I have as similar situation with
df.apply
usingraw=True
along withresult_type="expand"
. It occurs withpandas>=1.1.0
,pandas==1.0.5
works as expected.pandas 1.0.5
pandas >= 1.1.0