Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

result_type behaviour in apply function is different from Pandas

See original GitHub issue

System information

OS Platform and Distribution: Linux, Ubuntu 20.04
Modin version: Latest development master branch.
Python version: 3.8.10
Code we can use to reproduce:

import pandas
import modin.pandas as pd
import numpy as np
import ray
ray.init()

data = np.random.randint(0, 5, size=(5, 10))
df_modin = pd.DataFrame(data)
df_pandas = pandas.DataFrame(data)

df_new = df_modin.apply(np.square, result_type="reduce")
df_new2 = df_pandas.apply(np.square, result_type="reduce")

print(df_new)
print(type(df_new))
print(df_new2)
print(type(df_new2))

Describe the problem

The result_type = "reduce" argument for a function that returns a dataframe, e.g., np.square, doesn’t have any effect in Pandas. However in Modin, it returns entire resulting dataframe as a Series.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mvashishthacommented, Oct 20, 2022

Let’s put numpy universal functions like np.sqrt aside because they seem to have special behavior with result_type (not sure whether it’s a bug): https://github.com/pandas-dev/pandas/issues/49190

for code below dataframe is

import pandas as pd

df = pd.DataFrame([[1,2, 3, 4],[5, 6, 7, 8]])

Here’s my understanding so far:

Pandas behavior

When axis = 0

When result_type=None

if returning scalar for all columns:
  # df.apply(lambda col: 9)
  return series of scalars
if returning series for any column:
  if any other column returns a list-like (not a series) with a different length than the dataframe we'd get by concatenating all the series results along axis=1:
    # df.apply(lambda col: pd.Series(9) if col[0] != 1 else [10, 11])
    raise ValueError
  else:
    # df.apply(lambda col: pd.Series([9, 10]) if col[0] == 1 else pd.Series([11, 12], index=['a', 'b']) if col[0] == 2 else [13, 14, 15, 16] if col[0] == 3 else 17)
    return dataframe created by concatenating each returned series along axis 1, treating each list result as series with the same index as the final dataframe, and repeating any scalars by the length of final dataframe
if returning list-like for any column:
  if any other column returns a list-like (not a series) with a different length:
    # df.apply(lambda col: [9] if col[0] ==1 else [10, 11])
    return series of resulting lists 
  if all other columns return list-likes of same length:
    # df.apply(lambda col: [9] if col[0] == 1 else [10])
    return dataframe created by concating lists along axis=1
  # df.apply(lambda col: [9, 10] if col[0] == 1 else 11)
  #in this case we have some list-likes of the same lengths, and some scalars
  return dataframe made by repeating scalars up to common length and concat the lists

When result_type=reduce

Every kind of function I can think of returns a series (including every one above). Note I have already excluded the numpy universal functions, which seem to be the only exception.

When result_type=expand

Every kind of function I can think of behaves the exact same way as when result_type is None. I filed https://github.com/pandas-dev/pandas/issues/49196 for this.

When result_type=broadcast

result is always a dataframe, as documentation says

When axis = 1

result_type='reduce' has no effect (see https://github.com/pandas-dev/pandas/issues/49188). result_type=broadcast seems to be same as axis=0 (see also https://github.com/pandas-dev/pandas/issues/49188). result_type='expand' does seem to have an effect.

Conclusion: what to do in Modin

We need to figure out what’s going on in pandas with axis=0 vs axis=1. Not clear what the intent was or whether current behavior is buggy. For this follow https://github.com/pandas-dev/pandas/issues/49188
at least for axis=0, the result_type=reduce => result_type = series assumption seems good apart from the numpy ufunc problem, which we should clarify in https://github.com/pandas-dev/pandas/issues/49190. if the ufunc behavior is intended, we should fix it.
(result_type=broadcast => result_type = dataframe) assumption is still good
once we clarify all the expected behavior, we should add tests for all these cases. Note some of the cases I went through at the beginning of this post will break if different partitions have different types of funciton results, similar to #4690. But we should test such cases anyway and xfail with a todo linking to #4690

0reactions

mvashishthacommented, Oct 20, 2022

Also cc @dchigarev who wrote the most recent version of apply result type inference.