question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

download can silently fail?

See original GitHub issue

As reported by @yuhaia on MacOS Catalina 10.15.7 with Python 3.7 for macavaney:anserini-trec-robust04’s anserini index download. https://github.com/allenai/ir_datasets/issues/52#issuecomment-814020499

The best place to start would be to simulate a failed download on the same OS, python version, and dataset and see if I can reproduce it. If not, I’m not sure what to try next.

@yuhaia-- if there are any more details on this that could help me get to the bottom of this problem, please let me know!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
seanmacavaneycommented, Dec 5, 2021

The idea is to not create the docstore unless it’s necessary to reduce storage overhead. For a simple, in-sequence iteration over a corpus (a really common operation, e.g., for indexing), a docstore usually isn’t needed*. But in most datasets, you cannot efficiently jump ahead, so a docstore (containing document index offset info) is built when the user slices the iterator. In the case you show, it isn’t strictly necessary since the slice doesn’t jump ahead and has no stride. But right now, it doesn’t distinguish different slicing behaviours to conditionally trigger the creation of a docstore.

* There are exceptions to this rule, particularly when iteration cannot be done efficiently. This is a decision that I’ve been making on a case-by-case basis. I’m totally open to changing the behaviour for robust04, given that it’s a bit expensive to parse the corpus and the docstore doesn’t add very much storage overhead (~1.2GB).

There are probably other optimisations I could make to speed up the parsing of robust04 too, though, that would avoid the docstore overhead for simple iterations. The gzip-encoded version iterates about twice as fast as the .z-encoded files. But it’s still not super efficient, mostly because it’s using bs4 to handle xml-like tags and such. I’ve been working on some improvements for HTML parsing (#64), and that should be applicable here as well.

1reaction
cakikicommented, Dec 5, 2021

The newest fix does indeed work!

root@32c432cd2d52:/ir_datasets# python -m test.integration.trec_robust04
trec-robust04 docs: 528155doc [00:04, 115252.55doc/s]
[INFO] [finished] trec-robust04 docs: [00:04] [528155doc] [115250.72doc/s]
[INFO] [starting] doc lookups by index
[INFO] [finished] doc lookups by index [15ms]
[INFO] [starting] doc lookups by doc_id
[INFO] [finished] doc lookups by doc_id [1ms]
trec-robust04 qrels: 311410qrel [00:01, 273996.57qrel/s]
[INFO] [finished] trec-robust04 qrels: [00:01] [311410qrel] [273985.93qrel/s]
trec-robust04/fold1 qrels: 62789qrel [00:01, 58505.60qrel/s]
[INFO] [finished] trec-robust04/fold1 qrels: [00:01] [62789qrel] [58503.19qrel/s]
trec-robust04/fold2 qrels: 63917qrel [00:01, 59287.29qrel/s]
[INFO] [finished] trec-robust04/fold2 qrels: [00:01] [63917qrel] [59284.97qrel/s]
trec-robust04/fold3 qrels: 62901qrel [00:01, 58342.43qrel/s]
[INFO] [finished] trec-robust04/fold3 qrels: [00:01] [62901qrel] [58339.83qrel/s]
trec-robust04/fold4 qrels: 57962qrel [00:01, 53938.70qrel/s]
[INFO] [finished] trec-robust04/fold4 qrels: [00:01] [57962qrel] [53936.20qrel/s]
trec-robust04/fold5 qrels: 63841qrel [00:01, 59561.41qrel/s]
[INFO] [finished] trec-robust04/fold5 qrels: [00:01] [63841qrel] [59559.17qrel/s]
trec-robust04 queries: 250query [00:00, 4486.67query/s]
[INFO] [finished] trec-robust04 queries: [00:00] [250query] [4482.17query/s]
trec-robust04/fold1 queries: 50query [00:00, 1461.77query/s]
[INFO] [finished] trec-robust04/fold1 queries: [00:00] [50query] [1459.63query/s]
trec-robust04/fold2 queries: 50query [00:00, 1505.39query/s]
[INFO] [finished] trec-robust04/fold2 queries: [00:00] [50query] [1503.21query/s]
trec-robust04/fold3 queries: 50query [00:00, 1540.91query/s]
[INFO] [finished] trec-robust04/fold3 queries: [00:00] [50query] [1538.66query/s]
trec-robust04/fold4 queries: 50query [00:00, 1527.09query/s]
[INFO] [finished] trec-robust04/fold4 queries: [00:00] [50query] [1524.08query/s]
trec-robust04/fold5 queries: 50query [00:00, 1512.74query/s]
[INFO] [finished] trec-robust04/fold5 queries: [00:00] [50query] [1510.66query/s]
.
----------------------------------------------------------------------
Ran 3 tests in 11.436s

OK

image

Read more comments on GitHub >

github_iconTop Results From Across the Web

Open with on file download silently fails | Firefox Support Forum
It does appear in the downloads list but it is marked as "File moved or missing". Looking a little deeper, I found I...
Read more >
Wget is silent, but it displays error messages - Super User
I want to download a file with Wget, but per the usual UNIX philosophy, I don't want it to output anything if the...
Read more >
Downloads fail silently if the default download location is deleted
After changing the file download location to 'Ask for each download' the download succeeded. Now I could set to 'Downloads' and it still...
Read more >
File Download with Error Handling - Atlantbh Sarajevo
How to detect download errors and notify the user when they happen? Here are a few methods with their advantages and disadvantages.
Read more >
Why does my Curl command fail to download a file most of the ...
HTTP error 524 means that the server was able to complete a TCP connection to the server, ... Use -s/--silent to make curl...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found