PERF Consider using argpartition in ndcg_score
See original GitHub issueAs reported by @karlhigley,
I now take issue with the implementation of NDCG in sklearn, which seems like it could use
argpartition
and be much faster for long lists of items with small top-K results (e.g. NDCG@100 with 60000 items.)
Would you like to propose a PR to improve it?
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
Given two matrices and a function that takes two vectors, how ...
Edit: I've listed three implementations for the problem. Firstly, it's possible to completely eliminate loops, but the resulting function ...
Read more >numpy.argpartition — NumPy v1.24 Manual
Perform an indirect partition along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of...
Read more >numpy.argpartition() in Python - GeeksforGeeks
argpartition () function is used to create a indirect partitioned copy of input array with its elements rearranged in such a way that...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For context, I am training models on the MovieLens 25M dataset and would like to compute a learning curve for NDCG on the validation set as training progresses. IIRC, that involves computing ~160k NDCGs over ~60k items each. Doing so takes much longer than training 10-20 epochs, which makes it cost prohibitive.
I don’t actually need to sort all 60k items to compute NDCG@100 though; I just need to identify the top 100 predicted scores and their indices, and then fetch the corresponding relevance labels/scores for those 100 items.
My main concern isn’t the theoretical asymptotic complexity (though I was interested to learn that), it’s that I couldn’t use the tool to do the job. I’d still be interested to see if the performance of
ndcg_score
can be improved, but in the mean time I’ve started writing my own ranking metrics library in order to find an implementation with acceptable cost.It’s super-cool that y’all noticed my post and created this issue, and I hope there’s a worthwhile performance optimization here! Probably doesn’t make that much difference for the average user, but might help the worst case users (e.g. me.) 😆