question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Direct access to all doc_ids

See original GitHub issue

This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn’t seem to be. Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time). Currently, this is what I do:

data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).

Is there a better way to achieve this?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
seanmacavaneycommented, Jul 26, 2022

Hey @ArthurCamara – quick update on this. Over the past few months I’ve been working on an alternative file format to facilitate doc_id->idx and idx->doc_id lookups, iteration over doc_ids, etc. It also aims to ditch the searchsorted approach for doc_id->idx lookups in favor of an on-disk hash table, since the former requires doc_ids to be padded to the same length (adding considerable size to some lookups) and has an unfavourable access pattern on disk, which makes it a bit slow until everything is loaded into the cache.

Not sure when it’ll be ready for primetime, but just letting you know that a solution to this is in the works.

0reactions
ArthurCamaracommented, Jul 26, 2022

That sounds awesome, @seanmacavaney. Thanks for letting me know!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Document ID link reverts back to original SharePoint link
I'm able to view the DocID under the Document ID column, however whenever I try to generate a SharePoint link for the uploaded...
Read more >
Efficient way to retrieve all _ids in ElasticSearch - Stack Overflow
Is it possible to use multiprocessing approach but skip the files and query ES directly? – ruslaniv · Of course, you just remove...
Read more >
Direct access for single sign on with a browser - FileHold
FileHold can be configured to use Integrated Windows Authentication (IWA) to provide a single sign on solution for the document management system.
Read more >
Workarounds for VIX API VM Direct Access Function by ...
The VIX API VM Direct Access Function may be used by vSphere users with limited privileges if all of the following three privileges...
Read more >
HPE Moonshot Chassis Manager 2.0 v2.0 User Guide
The iLO Direct Access feature enables network access to any iLO on a server blade. When enabled, direct communication with each iLO from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found