Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add key to sorting functions

See original GitHub issue

Many python functions (sorting, max/min) accept a key argument, perhaps they could in pandas too.

The terrible motivating example was this awful hack from this question… for which maybe one could do

df.sort_index(key=lambda t: literal_eval(t[1:-1]))

This would still be an awful awful hack, but a slightly less awful one.

Issue Analytics

State:
Created 10 years ago
Reactions:11
Comments:20 (11 by maintainers)

Top GitHub Comments

9reactions

SethMMortoncommented, Apr 12, 2015

As long as it is well documented how to use the key on multiple columns, I don’t much care. Just having the key option would be a huge step in the right direction.

6reactions

SethMMortoncommented, Apr 12, 2015

Let me add a more concrete example of when having a key option to sort would be easier to use from a user’s perspective, and may possibly be more efficient than Categoricals.

Suppose that a user had data in text files, and one of the columns contains distances with associated units, i.e. “45.232m” or “0.59472km”. Let’s say there are ~500,000 rows, and each has a different distance. Now, suppose the user wanted to sort based the data in this column. Obviously, they will have to do some sort of transformation of this data to make it sortable, since a purely ordinal sort will not work. As far as I can tell, currently the two most obvious results are to a) make a new column of the transformation result and use that column for sorting, or b) make the column a category, and then sort the data in the list, and make the categories the sorted data.

import re
from pandas import read_csv

def transform_distance(x):
    """Convert string of value and unit to a float.
    Since this is just an example, skip error checking."""
    m = re.match(r'(.*)([mkc]?m)', x)
    units = {'m': 1, 'cm': 0.01, 'mm': 0.001, 'km': 1000}
    return float(m.group(1)) * units[m.group(2)]

df = read_csv('myfile.data')

# Sort method 1: Make a new column and sort on that.
df['distances_sort'] = df.distances.map(transform_distance)
df.sort('distances_sort')

# Sort method 2: Use categoricals
df.distances = df.distances.astype('category')
df.distances.cat.reorder_categories(sorted(df.distances, key=transform_distance), inplace=True, ordered=True)
df.sort('distances')

To me, neither seem entirely preferable because method 1 adds extra data to the DataFrame, which will take up space and require me to filter out later if I want to write out to file, and method 2 requires sorting all the data in my column before I can sort the data in my DataFrame, which unless I am mistaken is not incredibly efficient.

Things would be made worse if I then wanted to read in a second file and append that data to the DataFrame I already had, or if I wanted to modify the existing data in the “distances” column. I would then need to re-update my “distances_sort” column, or re-perform the reorder_categories call before I could sort again.

If a key method were added to sort, all the boilerplate goes away as well as the extra processing. Sorting would just become

# Proposed sort method: Use a key argument
df.sort('distances', key=transform_distances)

Now, no matter how I update or modify my distances column, I do not need to do any additional pre-processing before sorting.

The key argument could be flexible and support either a function, or a dict of functions. This second input type would be used if you wanted to provide a key for only a few columns, or different keys for different columns; for example:

# Supporting multi-column sorting with a key.
# In this case, only columns 'a' and 'c' would use the key for sorting,
# and 'b' and 'd' would sort in the standard way.
df.sort(['a', 'b', 'c', 'd'], key={'a': lambda x: x.lower(), 'c': transform_distances})