Performance compared to other simpler algorithm, and keygen improvement
See original GitHub issueI use the following function to have natsort and it worked fine for all my usecases until now - now I needed to sort paths with subfolders, and your library does a very good job at sorting those correctly via ns.PATH
, so thank you for that.
def sorted_alphanumeric(data):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(data, key=alphanum_key)
For a simple test case (for non-path strings like ["a", "b", "c", "d"]
), it’s 10x faster than your library. I tested this by simply running each sort 10k times.
Not sure if something can be implemented to sort out “simple” cases like those and use a faster method, but I’d like to propose another way where I could speed up your library in my test case a lot:
The problem mainly is creating the ‘key’ which takes time. So if we create they key once and then use the sort 10k times, the keygen has to run only twice.
In def natsorted(seq, key=None, reverse=False, alg=ns.DEFAULT)
you allow key
to be given to the function, yet giving it does absolutely nothing. You still run key = natsort_keygen(key, alg)
every time, and the key
will be just passed along.
So add if not key
before key = natsort_keygen(key, alg)
. Hence, when given an already generated key, you won’t generate it again. From doing that I could speed up my test from being 13x slower to only being 2x slower.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
I understand your surprise/frustration that
natsort
is not as fast as the “simple” algorithm you found on Stack Overflow. When I first started working onnatsort
, the algorithm was not much different from that, and then it grew over time. The complexity in the library is due to one of two things:ns.PATH
, for example)I strongly recommend you read the
natsort
documentation’s How It Works section, which goes into detail how and why most (all?) of these options or safeguards were added.Ultimately, my primary goal/focus for
natsort
is a library that will correctly sort a user’s input naturally (by whatever definition that user defines “naturally”). Having a performant algorithm is a secondary focus. That is not to say I do not want it to be fast - I have completely re-written the library twice over the project’s lifetime in order to improve the performance. I believe that the current infrastructure (using a functional/data pipeline approach) is as fast as I can realistically make the code without sacrificing maintainability.But, because my primary goal is correctness, there are two steps that
natsort
performs that are in addition to what you show in the “simple” algorithm which will slow it down. Those steps are:"é"
coming after"z"
instead of after"e"
.natsort
algorithm, each tuple of strings follows the pattern “string, number, string, number, etc.” (#7)TypeErrors
due Python internally trying to compare a string to a number.""
needs to be inserted before that number.Both of these slow down the processing, especially the latter. It would be possible to omit them, but at the risk of either giving incorrectly sorted results or having the program crash.
I could imagine adding a mode called
ns.UNSAFE
or something that disabled these. But, after maintaining this library for so many years, I have found that unexpected data happens all the time, and encouraging users to embrace the safety at the expense of a bit of speed is probably a good thing.I understand. Thank you for your elaborate answer.