question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add key to sorting functions

See original GitHub issue

Many python functions (sorting, max/min) accept a key argument, perhaps they could in pandas too.

.

The terrible motivating example was this awful hack from this question… for which maybe one could do

df.sort_index(key=lambda t: literal_eval(t[1:-1]))

This would still be an awful awful hack, but a slightly less awful one.

Issue Analytics

  • State:closed
  • Created 10 years ago
  • Reactions:11
  • Comments:20 (11 by maintainers)

github_iconTop GitHub Comments

9reactions
SethMMortoncommented, Apr 12, 2015

As long as it is well documented how to use the key on multiple columns, I don’t much care. Just having the key option would be a huge step in the right direction.

6reactions
SethMMortoncommented, Apr 12, 2015

Let me add a more concrete example of when having a key option to sort would be easier to use from a user’s perspective, and may possibly be more efficient than Categoricals.

Suppose that a user had data in text files, and one of the columns contains distances with associated units, i.e. “45.232m” or “0.59472km”. Let’s say there are ~500,000 rows, and each has a different distance. Now, suppose the user wanted to sort based the data in this column. Obviously, they will have to do some sort of transformation of this data to make it sortable, since a purely ordinal sort will not work. As far as I can tell, currently the two most obvious results are to a) make a new column of the transformation result and use that column for sorting, or b) make the column a category, and then sort the data in the list, and make the categories the sorted data.

import re
from pandas import read_csv

def transform_distance(x):
    """Convert string of value and unit to a float.
    Since this is just an example, skip error checking."""
    m = re.match(r'(.*)([mkc]?m)', x)
    units = {'m': 1, 'cm': 0.01, 'mm': 0.001, 'km': 1000}
    return float(m.group(1)) * units[m.group(2)]

df = read_csv('myfile.data')

# Sort method 1: Make a new column and sort on that.
df['distances_sort'] = df.distances.map(transform_distance)
df.sort('distances_sort')

# Sort method 2: Use categoricals
df.distances = df.distances.astype('category')
df.distances.cat.reorder_categories(sorted(df.distances, key=transform_distance), inplace=True, ordered=True)
df.sort('distances')

To me, neither seem entirely preferable because method 1 adds extra data to the DataFrame, which will take up space and require me to filter out later if I want to write out to file, and method 2 requires sorting all the data in my column before I can sort the data in my DataFrame, which unless I am mistaken is not incredibly efficient.

Things would be made worse if I then wanted to read in a second file and append that data to the DataFrame I already had, or if I wanted to modify the existing data in the “distances” column. I would then need to re-update my “distances_sort” column, or re-perform the reorder_categories call before I could sort again.

If a key method were added to sort, all the boilerplate goes away as well as the extra processing. Sorting would just become

# Proposed sort method: Use a key argument
df.sort('distances', key=transform_distances)

Now, no matter how I update or modify my distances column, I do not need to do any additional pre-processing before sorting.

The key argument could be flexible and support either a function, or a dict of functions. This second input type would be used if you wanted to provide a key for only a few columns, or different keys for different columns; for example:

# Supporting multi-column sorting with a key.
# In this case, only columns 'a' and 'c' would use the key for sorting,
# and 'b' and 'd' would sort in the standard way.
df.sort(['a', 'b', 'c', 'd'], key={'a': lambda x: x.lower(), 'c': transform_distances})
Read more comments on GitHub >

github_iconTop Results From Across the Web

Sorting HOW TO — Python 3.11.1 documentation
Both list.sort() and sorted() have a key parameter to specify a function (or other callable) to be called on each list element prior...
Read more >
How to make a Custom Sorting Function for Dictionary Key ...
First, sorting is done based upon the "4th" character in the keys. (that is, 1, 3, etc.) · Then sorting is done based...
Read more >
Sorting a Python Dictionary: Values, Keys, and More
In this tutorial, you'll get the lowdown on sorting Python dictionaries. By the end, you'll be able to sort by key, value, or...
Read more >
Python List Sort Key - Finxter
The list.sort() method takes another function as an optional key argument that allows you to modify the default sorting behavior. The key function...
Read more >
Sorting Arrays - Manual - PHP
Sorting Arrays ¶ · Some sort based on the array keys, whereas others by the values: $array['key'] = 'value'; · Whether or not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found