question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

result_type behaviour in apply function is different from Pandas

See original GitHub issue

System information

  • OS Platform and Distribution: Linux, Ubuntu 20.04
  • Modin version: Latest development master branch.
  • Python version: 3.8.10
  • Code we can use to reproduce:
import pandas
import modin.pandas as pd
import numpy as np
import ray
ray.init()

data = np.random.randint(0, 5, size=(5, 10))
df_modin = pd.DataFrame(data)
df_pandas = pandas.DataFrame(data)

df_new = df_modin.apply(np.square, result_type="reduce")
df_new2 = df_pandas.apply(np.square, result_type="reduce")

print(df_new)
print(type(df_new))
print(df_new2)
print(type(df_new2))

Describe the problem

The result_type = "reduce" argument for a function that returns a dataframe, e.g., np.square, doesn’t have any effect in Pandas. However in Modin, it returns entire resulting dataframe as a Series.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mvashishthacommented, Oct 20, 2022

Let’s put numpy universal functions like np.sqrt aside because they seem to have special behavior with result_type (not sure whether it’s a bug): https://github.com/pandas-dev/pandas/issues/49190

for code below dataframe is

import pandas as pd

df = pd.DataFrame([[1,2, 3, 4],[5, 6, 7, 8]])

Here’s my understanding so far:

Pandas behavior

When axis = 0

When result_type=None

if returning scalar for all columns:
  # df.apply(lambda col: 9)
  return series of scalars
if returning series for any column:
  if any other column returns a list-like (not a series) with a different length than the dataframe we'd get by concatenating all the series results along axis=1:
    # df.apply(lambda col: pd.Series(9) if col[0] != 1 else [10, 11])
    raise ValueError
  else:
    # df.apply(lambda col: pd.Series([9, 10]) if col[0] == 1 else pd.Series([11, 12], index=['a', 'b']) if col[0] == 2 else [13, 14, 15, 16] if col[0] == 3 else 17)
    return dataframe created by concatenating each returned series along axis 1, treating each list result as series with the same index as the final dataframe, and repeating any scalars by the length of final dataframe
if returning list-like for any column:
  if any other column returns a list-like (not a series) with a different length:
    # df.apply(lambda col: [9] if col[0] ==1 else [10, 11])
    return series of resulting lists 
  if all other columns return list-likes of same length:
    # df.apply(lambda col: [9] if col[0] == 1 else [10])
    return dataframe created by concating lists along axis=1
  # df.apply(lambda col: [9, 10] if col[0] == 1 else 11)
  #in this case we have some list-likes of the same lengths, and some scalars
  return dataframe made by repeating scalars up to common length and concat the lists

When result_type=reduce

Every kind of function I can think of returns a series (including every one above). Note I have already excluded the numpy universal functions, which seem to be the only exception.

When result_type=expand

Every kind of function I can think of behaves the exact same way as when result_type is None. I filed https://github.com/pandas-dev/pandas/issues/49196 for this.

When result_type=broadcast

result is always a dataframe, as documentation says

When axis = 1

result_type='reduce' has no effect (see https://github.com/pandas-dev/pandas/issues/49188). result_type=broadcast seems to be same as axis=0 (see also https://github.com/pandas-dev/pandas/issues/49188). result_type='expand' does seem to have an effect.

Conclusion: what to do in Modin

  • We need to figure out what’s going on in pandas with axis=0 vs axis=1. Not clear what the intent was or whether current behavior is buggy. For this follow https://github.com/pandas-dev/pandas/issues/49188
  • at least for axis=0, the result_type=reduce => result_type = series assumption seems good apart from the numpy ufunc problem, which we should clarify in https://github.com/pandas-dev/pandas/issues/49190. if the ufunc behavior is intended, we should fix it.
  • (result_type=broadcast => result_type = dataframe) assumption is still good
  • once we clarify all the expected behavior, we should add tests for all these cases. Note some of the cases I went through at the beginning of this post will break if different partitions have different types of funciton results, similar to #4690. But we should test such cases anyway and xfail with a todo linking to #4690
0reactions
mvashishthacommented, Oct 20, 2022

Also cc @dchigarev who wrote the most recent version of apply result type inference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Difference between map, applymap and apply methods in ...
apply () method applies functions along an axis, either column-wise or row-wise. When we create a function to use with df.
Read more >
pandas.DataFrame.apply — pandas 1.5.2 documentation
Apply a function along an axis of the DataFrame. ... By default ( result_type=None ), the final return type is inferred from the...
Read more >
Pandas DataFrame apply() Examples - DigitalOcean
Pandas DataFrame apply () function is used to apply a function along an axis of the DataFrame. The function syntax is: def apply(...
Read more >
Why pandas apply method is slow, and how Terality ...
While processing data with pandas, it is quite common to perform a user-defined function on every row of a DataFrame.
Read more >
Pandas DataFrame: apply() function - w3resource
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found