Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Direct access to all doc_ids

See original GitHub issue

This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn’t seem to be. Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time). Currently, this is what I do:

data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())

which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).

Is there a better way to achieve this?

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

seanmacavaneycommented, Jul 26, 2022

Hey @ArthurCamara – quick update on this. Over the past few months I’ve been working on an alternative file format to facilitate doc_id->idx and idx->doc_id lookups, iteration over doc_ids, etc. It also aims to ditch the searchsorted approach for doc_id->idx lookups in favor of an on-disk hash table, since the former requires doc_ids to be padded to the same length (adding considerable size to some lookups) and has an unfavourable access pattern on disk, which makes it a bit slow until everything is loaded into the cache.

Not sure when it’ll be ready for primetime, but just letting you know that a solution to this is in the works.

0reactions

ArthurCamaracommented, Jul 26, 2022

That sounds awesome, @seanmacavaney. Thanks for letting me know!