Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: setting raw=True in the Dataframe.apply function causes a ValueError

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,))

Problem description

I am working on a dataset where values are bitmaskes, each 64bit integer is actually 4x16 (one per GPU per node). The format of the dataset is out of my control. So, I wrote the code above to create a new dataframe that created a multi index with the data split.

It works fine and all, but as I am solely using numpy functions in parse_node_row, I added parse=True in the apply call. However, this causes my script to crash! The minimal reproducible example above shows

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

But runs just fine when raw=False. What gives? There is no mention in the docs of any side-effects of using raw=True besides you getting a numpy array in your apply function, which is completely fine for me.

Expected Output

More performance!

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.0.4 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.1.1 Cython : 0.29.20 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.9.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.1 matplotlib : 3.2.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Feb 5, 2021

I’m not sure offhand.

1reaction

snake575commented, Sep 18, 2020

I have as similar situation with df.apply using raw=True along with result_type="expand". It occurs with pandas>=1.1.0, pandas==1.0.5 works as expected.

import pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

df.apply(lambda row: (1, 2, 3), axis=1, raw=True, result_type="expand")

pandas 1.0.5

0    (1, 2, 3)
1    (1, 2, 3)
dtype: object

pandas >= 1.1.0

ValueError: Shape of passed values is (2, 3), indices imply (2, 2)

Top Results From Across the Web

raw=True causes ValueError in pandas DataFrame.apply

The moment we return a tuple or multiple values from a custom apply function, it starts giving Value Error. Doesn't Work -

pandas.DataFrame.apply — pandas 1.5.2 documentation

Apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's...

v1.1.3.rst.txt - Pandas

... Cython is now the most recent bug-fix version (0.29.21) (:issue:`36296`). ... regression in :meth:`DataFrame.apply` with ``raw=True`` and user-function ...

What's new in 1.1.0 (July 28, 2020) - Pandas

Default ``dropna`` is set to True, which will exclude NaNs in keys In [23]: ... Bug in GroupBy.apply() raises ValueError when the by...

Show Source - Pandas - PyData |

DataFrame (df_list, columns=["a", "b", "c"]) df_dropna .. ipython:: python # Default ``dropna`` is set to True, which will exclude NaNs in keys ...