question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: setting raw=True in the Dataframe.apply function causes a ValueError

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [100, 300], "b": [200, 400]})

def parse_node_row(row, index):
    def unpack_bitmask(val):
        return [np.bitwise_and(np.right_shift(val, i * 16), 0xFFFF) for i in range(4)]

    arr = np.concatenate(np.apply_along_axis(unpack_bitmask, 0, row)).ravel()

    return pd.Series(arr, index=index)

iterables = [df.columns, ["gpu{}".format(i) for i in range(4)]]
midx = pd.MultiIndex.from_product(iterables, names=['node', 'gpu'])

ndf = df.apply(parse_node_row, axis="columns", result_type="expand", args=(midx,))

Problem description

I am working on a dataset where values are bitmaskes, each 64bit integer is actually 4x16 (one per GPU per node). The format of the dataset is out of my control. So, I wrote the code above to create a new dataframe that created a multi index with the data split.

It works fine and all, but as I am solely using numpy functions in parse_node_row, I added parse=True in the apply call. However, this causes my script to crash! The minimal reproducible example above shows

ValueError: Shape of passed values is (2, 8), indices imply (2, 2)

But runs just fine when raw=False. What gives? There is no mention in the docs of any side-effects of using raw=True besides you getting a numpy array in your apply function, which is completely fine for me.

Expected Output

More performance!

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.0.4 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.1.1 Cython : 0.29.20 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.9.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.1 matplotlib : 3.2.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Feb 5, 2021

I’m not sure offhand.

1reaction
snake575commented, Sep 18, 2020

I have as similar situation with df.apply using raw=True along with result_type="expand". It occurs with pandas>=1.1.0, pandas==1.0.5 works as expected.

import pandas as pd

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

df.apply(lambda row: (1, 2, 3), axis=1, raw=True, result_type="expand")

pandas 1.0.5

0    (1, 2, 3)
1    (1, 2, 3)
dtype: object

pandas >= 1.1.0

ValueError: Shape of passed values is (2, 3), indices imply (2, 2)
Read more comments on GitHub >

github_iconTop Results From Across the Web

raw=True causes ValueError in pandas DataFrame.apply
The moment we return a tuple or multiple values from a custom apply function, it starts giving Value Error. Doesn't Work -
Read more >
pandas.DataFrame.apply — pandas 1.5.2 documentation
Apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's...
Read more >
v1.1.3.rst.txt - Pandas
... Cython is now the most recent bug-fix version (0.29.21) (:issue:`36296`). ... regression in :meth:`DataFrame.apply` with ``raw=True`` and user-function ...
Read more >
What's new in 1.1.0 (July 28, 2020) - Pandas
Default ``dropna`` is set to True, which will exclude NaNs in keys In [23]: ... Bug in GroupBy.apply() raises ValueError when the by...
Read more >
Show Source - Pandas - PyData |
DataFrame (df_list, columns=["a", "b", "c"]) df_dropna .. ipython:: python # Default ``dropna`` is set to True, which will exclude NaNs in keys ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found