question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: DateFrame.sort_values(by=[x,y], inplace=True) speed improvement?

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

cnt = 10000000
np.random.seed(1234)
d_orig = pd.DataFrame({'timestamp': np.random.randint(0,np.iinfo(np.int64).max, cnt), 'user_id': np.random.randint(0,np.iinfo(np.int64).max, cnt)})


d = d_orig.copy()
%timeit d.sort_values(by=['user_id', 'timestamp'], inplace=True)

def sort_one_by_one(d, col1, col2):
    """
    Equivalent to pd.sort_values(by=[col1, col2]), but faster.
    """
    d.sort_values(by=[col2], inplace=True)
    d.sort_values(by=[col1], kind='mergesort', inplace=True) # keeps ordering of sorted col2 same


d = d_orig.copy()
%timeit sort_one_by_one(d, 'user_id', 'timestamp')

Problem description

I have a timestamped dataset with user ids and other information. I need to process (with numba) sequentially the dataset and for this I need it sorted by user_id and then by timestamp for each user_id.

First and obvious aproach is:

data.sort_values(by=['user_id', 'timestamp'], inplace=True)

I’m using inplace because the dataset is HUGE (yet fits into RAM and occupis approx 1/3 of computer’s RAM) and by this I hope it wont explode much during processing. The thing is, this direct approach is slow. I noticed, than sorting first by one column and then sort by the other (stabile sort = mergesort) is much faster. Depending on data used, I saw even 4x shorter time of processing, but on random seed 1234 data it is 3x.

I think mine solution works (I checked it by checking that the dataset is non-decreasing in user_id and non-decreasing in timestamp for each user_id.

Do I miss something, will this method work worse somewhere or in some situation? Both on small and big data (raise, lower the cnt variable) it behaves similarly. Would you consider it an enhancement and performance speedup? (very easy to implement 😉

Output

1 loop, best of 3: 14.9 s per loop 1 loop, best of 3: 4.96 s per loop

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.6.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3 pytest: 3.1.3 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.13.1 scipy: None xarray: None IPython: 5.4.1 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Jul 29, 2017

note that inplace doesn’t matter here per se, its really a multi-column sort that would benefit from this.

0reactions
jrebackcommented, Mar 25, 2018

@mficek ok fair enough. If you want to continue to make this work even in a limited scenario (which is detectable), pls ping / re-open.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataFrame.sort_values(inplace=True) is slow and eats too ...
Stumbled upon this one when working on a data frame that barely fits RAM. DataFrame.sort_values (variant 1 in the code above) is needlessly...
Read more >
Pandas Sort: Your Guide to Sorting Data in Python
In this tutorial, you'll learn how to sort data in a pandas DataFrame using the pandas sort ... Using .sort_values() In Place; Using...
Read more >
In pandas, is inplace = True considered harmful, or not?
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In general, there are no performance benefits...
Read more >
DataFrame.sort_values
Specify list for multiple sort orders. If this is a list of bools, must match the length of the by. inplacebool, default False....
Read more >
Why You Should Probably Never Use pandas inplace=True
This article will explain what the pandas inplace=True keyword means, ... (and only very rarely offers any performance improvement at all).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found