question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


I’m seeing a significant variance in timings for common math operations (e.g. mean, std, max) on a large Pandas Series vs the underlying NumPy array. An code example is shown below with 1 million elements and a 10x speed difference. The screenshot below uses 10 million elements.

I’ve generated a testing module (https://github.com/ianozsvald/dtype_pandas_numpy_speed_test) which several people have tried on Intel & AMD hardware: https://github.com/ianozsvald/dtype_pandas_numpy_speed_test/issues/1

This module confirms the general trend that all of these operations are faster on the underlying NumPy array (not unsurprising as it avoids the despatch machinery) but for float operations the speed hit using Pandas seems to be extreme:

timings

Code Sample, a copy-pastable example

A Python module exists in this repo along with reports from several other users with screenshots of their graphs, the same general behaviour is seen across different machines: https://github.com/ianozsvald/dtype_pandas_numpy_speed_test

# note this is copied from my README linked above.
# paste into IPython or a Notebook
import pandas as pd
import numpy as np
arr = pd.Series(np.ones(shape=1_000_000))
arr.values.dtype                                                                                                                                                         
Out[]: dtype('float64')

arr.values.mean() == arr.mean()                                                                                                                                           
Out[]: True

# call arr.mean() vs arr.values.mean(), note circa 10* speed difference
# with 4ms vs 0.4ms
%timeit arr.mean()
4.59 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr.values.mean()
485 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# note that arr.values dereference is very cheap (nano seconds)
%timeit arr.values 
456 ns ± 0.828 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Problem description

Is this slow-down expected? The slowdown feels extreme but perhaps my testing methodology is flawed? I expect the float & integer math to operate at approximately the same speed but instead we see a significant slow-down for Pandas float operations vs their NumPy counterparts.

I’ve added some extra graphs:

Expected Output

Output of pd.show_versions()

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.6.7-050607-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.0.4 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.1.1.post20200529 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Jun 15, 2020

We’re not actually using np.nanmean, but our own implementation but which is (I suppose) doing something very similar as numpy’s.

So the main reason pandas is slower compared to numpy is because we have “skipping missing values” by default, which numpy doesn’t do.

BTW, there is coming a “nullable float” dtype (https://github.com/pandas-dev/pandas/pull/34307), similarly as the nullable integer dtype, where pd.NA is used instead of NaN as the missing value indicator (using a mask under the hood), and that is actually faster than the “nanfunc” approach:

In [1]: arr = pd.Series(np.ones(shape=1_000_000))                                                                                                                                                                  

In [2]: arr2 = arr.astype("Float64")  

In [3]: %timeit arr.sum()  
1.93 ms ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit arr2.sum() 
978 µs ± 117 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(showing “sum” instead of “mean”, because for mean we don’t yet have the faster “masked” implementation, https://github.com/pandas-dev/pandas/issues/34754)

1reaction
jrebackcommented, Jun 14, 2020

on your methodology, be sure to time with and w/o bottlenck

In [18]: import pandas as pd 
    ...: import numpy as np 
    ...: s = pd.Series(np.ones(shape=1_000_000))                                                                                                                                                                    

In [19]: pd.options.compute.use_bottleneck=False                                                                                                                                                                    

In [20]: %timeit s.mean()                                                                                                                                                                                           
2.83 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: pd.options.compute.use_bottleneck=True                                                                                                                                                                     

In [22]: %timeit s.mean()                                                                                                                                                                                           
1.21 ms ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit s.to_numpy().mean()                                                                                                                                                                                
365 µs ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: %prun s.mean()                                                                                                                                                                                             
         99 function calls in 0.002 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 {built-in method bottleneck.reduce.nanmean}
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 nanops.py:155(_has_infs)
        4    0.000    0.000    0.000    0.000 _ufunc_config.py:39(seterr)
        1    0.000    0.000    0.002    0.002 series.py:4148(_reduce)
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.001    0.001 nanops.py:61(_f)
        1    0.000    0.000    0.001    0.001 nanops.py:97(f)

i think it should be clear that pandas mean is doing a lot more work than numpy by

  • checking & dispatching on appropriate dtypes (e.g. we do means of datetimes for example)
  • checking for infinity (the slowdown here)

I suppose we don’t care about inf checking in this case. I think this was here historically because we may (depending on some options) treat these as NaN’s and exclude them.

happy to take a PR here to remove that checking.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Chapter 4. NumPy Basics: Arrays and Vectorized Computation
Arrays are important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization....
Read more >
Dunder Data Challenge #2 — Explain the 1000x Speed ...
The bikes dataset below has about 50,000 rows. Calling the mean method on the entire DataFrame returns the mean of all the numeric...
Read more >
Calculate the average, variance and standard deviation in ...
One can calculate the variance by using numpy.var() function in python. Syntax: numpy.var(a, axis=None, dtype=None, out=None, ...
Read more >
NumPy arange(): How to Use np.arange() - Real Python
Creating NumPy arrays is important when you're working with other Python ... The arguments of NumPy arange() that define the values contained in...
Read more >
101 NumPy Exercises for Data Analysis (Python) - ML+
How to get the common items between two python numpy arrays? Difficulty Level: L2 ... Q. Swap columns 1 and 2 in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found