Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow to add new URLs and a lack of verbose logging output

See original GitHub issue

I am running ArchiveBox v0.4.21. The archive method section of my config is this:

[ARCHIVE_METHOD_TOGGLES]
SAVE_GIT = false
SAVE_MEDIA = false
SAVE_SINGLEFILE = false
SAVE_READABILITY = False

Here is the output of adding one new URL from my latest Pinboard JSON dump:

$ archivebox add < ~/library/conf/pinboard.json
[i] [2020-08-31 19:26:48] ArchiveBox v0.4.21: archivebox add
    > /home/pigmonkey/tmp/bookmarks

[+] [2020-08-31 19:28:39] Adding 871 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1598902119-import.txt
    > Parsed 871 URLs from input (Generic JSON)
    > Found 1 new URLs not already in index

[*] [2020-08-31 19:30:23] Writing 880 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3
    √ /home/pigmonkey/tmp/bookmarks/index.json
    √ /home/pigmonkey/tmp/bookmarks/index.html

[▶] [2020-08-31 19:30:29] Collecting content for 1 Snapshots in archive...

[+] [2020-08-31 19:30:29] "welcome the covid influencer - Culture Study"
    https://annehelen.substack.com/p/welcome-the-covid-influencer
    > ./archive/1598900405.0
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org

[√] [2020-08-31 19:32:40] Update of 1 pages complete (2.18 min)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /home/pigmonkey/tmp/bookmarks/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-31 19:34:26] Writing 880 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3
    √ /home/pigmonkey/tmp/bookmarks/index.json
    √ /home/pigmonkey/tmp/bookmarks/index.html

In total, my shell tells me this took 7 minutes and 43 minutes to complete, which seems exceptionally long for a single URL. There are a few things to note:

You can tell from the timestamps of the first two log entry blocks that there is an almost two minute gap (19:26:48 to 19:28:39) between ArchiveBox starting up and actually beginning to add links. There is no indication of what ArchiveBox is doing during this time.
The time between the second log entry block (“Adding 871 links…”) and the third (“Writing 880 links…”) is another two minutes (19:28:39 to 19:30:23). Pretty much all of this time is spent parsing the 871 URLs from the input. During this time, ArchiveBox shows me a progress bar, so I do know it is doing something, unlike during the previous pause. But two minutes seems like a really long time to check which of 871 strings exist in a SQLite database. (I assume ArchiveBox has already parsed the URLs out of the input JSON at this point, since it has already printed the “Adding 871 links…” message.) Is ArchiveBox trying to do something more complicated here, or is this indicative of a problem?
The time between the log entry for the single new URL and the completion message is another two minutes (19:30:29 to 19:32:40). The individual archive methods all show a progress bar when they are running, and they all complete as fast as I would expect, so this is not a network issue. After the final (archive.org) archive method completes, I counted a 1 minute and 46 pause where ArchiveBox seems to just hang without giving me any indication of what it is doing.
The time between the last two log entries is yet another two minutes (19:32:40 to 19:34:26). What happens here is another mystery. After the hint output was printed, I counted 1 minute and 33 seconds where ArchiveBox seems to hang again, with no indication of what it is doing, before it prints the final “Writing 880 links…” output.

Overall, it seems like this process is slower than it should be. The lack of any logging output during the long pauses is exceptionally frustrating, since it makes it seem like the program has frozen and makes it difficult to debug what the problem may be. A flag to make the logging more verbose might be a good start to addressing this problem.

The slowness has kept me using the old pre-Django version of ArchiveBox, where it feels that the only performance limitation when adding new URLs is my network connection.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

pigmonkeycommented, Oct 8, 2020

That completely solves the problem.

When running the latest master, the add command takes an average of 3 minutes to add a single URL, despite it only taking 10-20 seconds to archive that URL. When using #502, the entire command only adds about a second on top of the archive time.

$ time archivebox add https://github.com/pirate/ArchiveBox/issues/461
[i] [2020-10-08 23:51:20] ArchiveBox v0.4.21: archivebox add https://github.com/pirate/ArchiveBox/issues/461
    > /home/pigmonkey/tmp/bookmarks

[+] [2020-10-08 23:51:20] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1602201080-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-10-08 23:51:20] Writing 1 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3

[▶] [2020-10-08 23:51:20] Collecting content for 1 Snapshots in archive...

[+] [2020-10-08 23:51:20] "github.com/pirate/ArchiveBox/issues/461"
    https://github.com/pirate/ArchiveBox/issues/461
    > ./archive/1602201080.524471
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > headers
      > archive_org

[√] [2020-10-08 23:51:34] Update of 1 pages complete (13.65 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /home/pigmonkey/tmp/bookmarks/index.html
    Or run the built-in webserver:
        archivebox server

real    0m14.607s
user    0m5.573s
sys     0m1.419s

Removing is also close to instant.

$ time archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461
[i] [2020-10-08 23:52:40] ArchiveBox v0.4.21: archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461
    > /home/pigmonkey/tmp/bookmarks

[*] Finding links in the archive index matching these exact patterns:
    https://github.com/pirate/ArchiveBox/issues/461

---------------------------------------------------------------------------------------------------
timestamp        | is_archived      | num_outputs      | url
"1602201080.524471" | true             | 0                | "https://github.com/pirate/ArchiveBox/issues/461"
---------------------------------------------------------------------------------------------------

[i] Found 1 matching URLs to remove.
    1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
    (1 data folders with 0 archived files will be deleted!)

[√] Removed 1 out of 922 links from the archive index.
    Index now contains 921 links.

real    0m0.897s
user    0m0.742s
sys     0m0.152s

The add command still prints the hint suggesting that I load the HTML index. That should probably go away if that file is going to be out of date until I run archivebox list --html >! index.html.

1reaction

cdvv7788commented, Oct 7, 2020

Well, we are in the middle of a big refactor. The json and html indexes are deprecated, but not yet removed. At the end of every command, they are being rebuild, which can still cause some delays. They will be completely removed from the process in a later iteration. I need to review the rm command, but the issue is probably the same. I will create a PR removing those bits so you can further experiment the speed improvements. It is a good idea to document that in a decently sized archive.

Top Results From Across the Web

python - Easier way to enable verbose logging - Stack Overflow

I find both --verbose (for users) and --debug (for developers) useful. Here's how I do it with logging and argparse :

Enable verbose logs for troubleshooting - Acoustic Help Center

Verbose logging records more information than the usual logging mode. Remember to enable it only for troubleshooting, because larger log files ...

Log Processing - LogRhythm Documentation

LogRhythm processes your organization's raw log data and presents it in a way that makes it easier to analyze and protect your network...

Troubleshoot self-hosted integration runtime - Azure

This article explores common troubleshooting methods for self-hosted integration runtime (IR) in Azure Data Factory and Synapse workspaces.

Critical Issues Addressed in PAN-OS Releases

Bugs Affected Platform(if any). /Affected Version Description (release note) PAN‑92564 8.0.0‑8.0‑8, 8.1.0 PAN‑86882 8.0.0‑8.0.7. and all older Mainlines PAN‑81990 PA‑5220,PA‑5250. /. 8.0.4 Multiple DP restarts by...