Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance compared to other simpler algorithm, and keygen improvement

See original GitHub issue

I use the following function to have natsort and it worked fine for all my usecases until now - now I needed to sort paths with subfolders, and your library does a very good job at sorting those correctly via ns.PATH, so thank you for that.

def sorted_alphanumeric(data):
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
    return sorted(data, key=alphanum_key)

For a simple test case (for non-path strings like ["a", "b", "c", "d"]), it’s 10x faster than your library. I tested this by simply running each sort 10k times.

Not sure if something can be implemented to sort out “simple” cases like those and use a faster method, but I’d like to propose another way where I could speed up your library in my test case a lot:

The problem mainly is creating the ‘key’ which takes time. So if we create they key once and then use the sort 10k times, the keygen has to run only twice.

In def natsorted(seq, key=None, reverse=False, alg=ns.DEFAULT) you allow key to be given to the function, yet giving it does absolutely nothing. You still run key = natsort_keygen(key, alg) every time, and the key will be just passed along.

So add if not key before key = natsort_keygen(key, alg). Hence, when given an already generated key, you won’t generate it again. From doing that I could speed up my test from being 13x slower to only being 2x slower.

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

SethMMortoncommented, Jan 28, 2020

I understand your surprise/frustration that natsort is not as fast as the “simple” algorithm you found on Stack Overflow. When I first started working on natsort, the algorithm was not much different from that, and then it grew over time. The complexity in the library is due to one of two things:

providing the user the flexibility to choose different definitions of “natural sort” (like ns.PATH, for example)
safeguards against “bad” data being given by the user

I strongly recommend you read the natsort documentation’s How It Works section, which goes into detail how and why most (all?) of these options or safeguards were added.

Ultimately, my primary goal/focus for natsort is a library that will correctly sort a user’s input naturally (by whatever definition that user defines “naturally”). Having a performant algorithm is a secondary focus. That is not to say I do not want it to be fast - I have completely re-written the library twice over the project’s lifetime in order to improve the performance. I believe that the current infrastructure (using a functional/data pipeline approach) is as fast as I can realistically make the code without sacrificing maintainability.

But, because my primary goal is correctness, there are two steps that natsort performs that are in addition to what you show in the “simple” algorithm which will slow it down. Those steps are:

Normalizing unicode input (#44)
- Basically ensure that unicode characters that look the same but are not encoded the same all are normalized to the same encoding convention. This prevents things like "é" coming after "z" instead of after "e".
Ensure that after going through the natsort algorithm, each tuple of strings follows the pattern “string, number, string, number, etc.” (#7)
- This eliminates TypeErrors due Python internally trying to compare a string to a number.
- This is commonly needed when the first character in a string is a number - a "" needs to be inserted before that number.

Both of these slow down the processing, especially the latter. It would be possible to omit them, but at the risk of either giving incorrectly sorted results or having the program crash.

I could imagine adding a mode called ns.UNSAFE or something that disabled these. But, after maintaining this library for so many years, I have found that unexpected data happens all the time, and encouraging users to embrace the safety at the expense of a bit of speed is probably a good thing.

0reactions

ganegocommented, Jan 28, 2020

I understand. Thank you for your elaborate answer.

Top Results From Across the Web

The Best Public and Private Keygen Algorithm - The New Stack

Performance, Ed25519 is the fastest performing algorithm across all metrics. As with ECDSA, public keys are twice the length of the desired bit ......

Understanding and Improving Graph Algorithm Performance ...

Our technique is motivated by the observation that communication behavior for graph algorithms is different when accessing values associated with vertices ...

Comparing the performance of machine learning algorithms ...

In this paper, we have worked on comparing various data mining algorithms using R tool and various comparison models. After comparison has been...

More Throughput vs. Less Latency: Understand the Difference

There are two ways to do this, with two different results. One is to improve the throughput, and the other to reduce latency....

Amdahl's law - Wikipedia

It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time...