Direct access to all doc_ids
See original GitHub issueThis is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn’t seem to be. Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time). Currently, this is what I do:
data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())
which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).
Is there a better way to achieve this?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Document ID link reverts back to original SharePoint link
I'm able to view the DocID under the Document ID column, however whenever I try to generate a SharePoint link for the uploaded...
Read more >Efficient way to retrieve all _ids in ElasticSearch - Stack Overflow
Is it possible to use multiprocessing approach but skip the files and query ES directly? – ruslaniv · Of course, you just remove...
Read more >Direct access for single sign on with a browser - FileHold
FileHold can be configured to use Integrated Windows Authentication (IWA) to provide a single sign on solution for the document management system.
Read more >Workarounds for VIX API VM Direct Access Function by ...
The VIX API VM Direct Access Function may be used by vSphere users with limited privileges if all of the following three privileges...
Read more >HPE Moonshot Chassis Manager 2.0 v2.0 User Guide
The iLO Direct Access feature enables network access to any iLO on a server blade. When enabled, direct communication with each iLO from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @ArthurCamara – quick update on this. Over the past few months I’ve been working on an alternative file format to facilitate
doc_id->idx
andidx->doc_id
lookups, iteration overdoc_id
s, etc. It also aims to ditch thesearchsorted
approach fordoc_id->idx
lookups in favor of an on-disk hash table, since the former requires doc_ids to be padded to the same length (adding considerable size to some lookups) and has an unfavourable access pattern on disk, which makes it a bit slow until everything is loaded into the cache.Not sure when it’ll be ready for primetime, but just letting you know that a solution to this is in the works.
That sounds awesome, @seanmacavaney. Thanks for letting me know!