Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

download can silently fail?

See original GitHub issue

As reported by @yuhaia on MacOS Catalina 10.15.7 with Python 3.7 for macavaney:anserini-trec-robust04’s anserini index download. https://github.com/allenai/ir_datasets/issues/52#issuecomment-814020499

The best place to start would be to simulate a failed download on the same OS, python version, and dataset and see if I can reproduce it. If not, I’m not sure what to try next.

@yuhaia-- if there are any more details on this that could help me get to the bottom of this problem, please let me know!

Issue Analytics

State:
Created 2 years ago
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

seanmacavaneycommented, Dec 5, 2021

The idea is to not create the docstore unless it’s necessary to reduce storage overhead. For a simple, in-sequence iteration over a corpus (a really common operation, e.g., for indexing), a docstore usually isn’t needed*. But in most datasets, you cannot efficiently jump ahead, so a docstore (containing document index offset info) is built when the user slices the iterator. In the case you show, it isn’t strictly necessary since the slice doesn’t jump ahead and has no stride. But right now, it doesn’t distinguish different slicing behaviours to conditionally trigger the creation of a docstore.

* There are exceptions to this rule, particularly when iteration cannot be done efficiently. This is a decision that I’ve been making on a case-by-case basis. I’m totally open to changing the behaviour for robust04, given that it’s a bit expensive to parse the corpus and the docstore doesn’t add very much storage overhead (~1.2GB).

There are probably other optimisations I could make to speed up the parsing of robust04 too, though, that would avoid the docstore overhead for simple iterations. The gzip-encoded version iterates about twice as fast as the .z-encoded files. But it’s still not super efficient, mostly because it’s using bs4 to handle xml-like tags and such. I’ve been working on some improvements for HTML parsing (#64), and that should be applicable here as well.

1reaction

cakikicommented, Dec 5, 2021

The newest fix does indeed work!

root@32c432cd2d52:/ir_datasets# python -m test.integration.trec_robust04
trec-robust04 docs: 528155doc [00:04, 115252.55doc/s]
[INFO] [finished] trec-robust04 docs: [00:04] [528155doc] [115250.72doc/s]
[INFO] [starting] doc lookups by index
[INFO] [finished] doc lookups by index [15ms]
[INFO] [starting] doc lookups by doc_id
[INFO] [finished] doc lookups by doc_id [1ms]
trec-robust04 qrels: 311410qrel [00:01, 273996.57qrel/s]
[INFO] [finished] trec-robust04 qrels: [00:01] [311410qrel] [273985.93qrel/s]
trec-robust04/fold1 qrels: 62789qrel [00:01, 58505.60qrel/s]
[INFO] [finished] trec-robust04/fold1 qrels: [00:01] [62789qrel] [58503.19qrel/s]
trec-robust04/fold2 qrels: 63917qrel [00:01, 59287.29qrel/s]
[INFO] [finished] trec-robust04/fold2 qrels: [00:01] [63917qrel] [59284.97qrel/s]
trec-robust04/fold3 qrels: 62901qrel [00:01, 58342.43qrel/s]
[INFO] [finished] trec-robust04/fold3 qrels: [00:01] [62901qrel] [58339.83qrel/s]
trec-robust04/fold4 qrels: 57962qrel [00:01, 53938.70qrel/s]
[INFO] [finished] trec-robust04/fold4 qrels: [00:01] [57962qrel] [53936.20qrel/s]
trec-robust04/fold5 qrels: 63841qrel [00:01, 59561.41qrel/s]
[INFO] [finished] trec-robust04/fold5 qrels: [00:01] [63841qrel] [59559.17qrel/s]
trec-robust04 queries: 250query [00:00, 4486.67query/s]
[INFO] [finished] trec-robust04 queries: [00:00] [250query] [4482.17query/s]
trec-robust04/fold1 queries: 50query [00:00, 1461.77query/s]
[INFO] [finished] trec-robust04/fold1 queries: [00:00] [50query] [1459.63query/s]
trec-robust04/fold2 queries: 50query [00:00, 1505.39query/s]
[INFO] [finished] trec-robust04/fold2 queries: [00:00] [50query] [1503.21query/s]
trec-robust04/fold3 queries: 50query [00:00, 1540.91query/s]
[INFO] [finished] trec-robust04/fold3 queries: [00:00] [50query] [1538.66query/s]
trec-robust04/fold4 queries: 50query [00:00, 1527.09query/s]
[INFO] [finished] trec-robust04/fold4 queries: [00:00] [50query] [1524.08query/s]
trec-robust04/fold5 queries: 50query [00:00, 1512.74query/s]
[INFO] [finished] trec-robust04/fold5 queries: [00:00] [50query] [1510.66query/s]
.
----------------------------------------------------------------------
Ran 3 tests in 11.436s

OK