Slow to add new URLs and a lack of verbose logging output
See original GitHub issueI am running ArchiveBox v0.4.21. The archive method section of my config is this:
[ARCHIVE_METHOD_TOGGLES]
SAVE_GIT = false
SAVE_MEDIA = false
SAVE_SINGLEFILE = false
SAVE_READABILITY = False
Here is the output of adding one new URL from my latest Pinboard JSON dump:
$ archivebox add < ~/library/conf/pinboard.json
[i] [2020-08-31 19:26:48] ArchiveBox v0.4.21: archivebox add
> /home/pigmonkey/tmp/bookmarks
[+] [2020-08-31 19:28:39] Adding 871 links to index (crawl depth=0)...
> Saved verbatim input to sources/1598902119-import.txt
> Parsed 871 URLs from input (Generic JSON)
> Found 1 new URLs not already in index
[*] [2020-08-31 19:30:23] Writing 880 links to main index...
√ /home/pigmonkey/tmp/bookmarks/index.sqlite3
√ /home/pigmonkey/tmp/bookmarks/index.json
√ /home/pigmonkey/tmp/bookmarks/index.html
[▶] [2020-08-31 19:30:29] Collecting content for 1 Snapshots in archive...
[+] [2020-08-31 19:30:29] "welcome the covid influencer - Culture Study"
https://annehelen.substack.com/p/welcome-the-covid-influencer
> ./archive/1598900405.0
> favicon
> wget
> pdf
> screenshot
> dom
> archive_org
[√] [2020-08-31 19:32:40] Update of 1 pages complete (2.18 min)
- 0 links skipped
- 1 links updated
- 0 links had errors
Hint: To view your archive index, open:
/home/pigmonkey/tmp/bookmarks/index.html
Or run the built-in webserver:
archivebox server
[*] [2020-08-31 19:34:26] Writing 880 links to main index...
√ /home/pigmonkey/tmp/bookmarks/index.sqlite3
√ /home/pigmonkey/tmp/bookmarks/index.json
√ /home/pigmonkey/tmp/bookmarks/index.html
In total, my shell tells me this took 7 minutes and 43 minutes to complete, which seems exceptionally long for a single URL. There are a few things to note:
-
You can tell from the timestamps of the first two log entry blocks that there is an almost two minute gap (19:26:48 to 19:28:39) between ArchiveBox starting up and actually beginning to add links. There is no indication of what ArchiveBox is doing during this time.
-
The time between the second log entry block (“Adding 871 links…”) and the third (“Writing 880 links…”) is another two minutes (19:28:39 to 19:30:23). Pretty much all of this time is spent parsing the 871 URLs from the input. During this time, ArchiveBox shows me a progress bar, so I do know it is doing something, unlike during the previous pause. But two minutes seems like a really long time to check which of 871 strings exist in a SQLite database. (I assume ArchiveBox has already parsed the URLs out of the input JSON at this point, since it has already printed the “Adding 871 links…” message.) Is ArchiveBox trying to do something more complicated here, or is this indicative of a problem?
-
The time between the log entry for the single new URL and the completion message is another two minutes (19:30:29 to 19:32:40). The individual archive methods all show a progress bar when they are running, and they all complete as fast as I would expect, so this is not a network issue. After the final (archive.org) archive method completes, I counted a 1 minute and 46 pause where ArchiveBox seems to just hang without giving me any indication of what it is doing.
-
The time between the last two log entries is yet another two minutes (19:32:40 to 19:34:26). What happens here is another mystery. After the hint output was printed, I counted 1 minute and 33 seconds where ArchiveBox seems to hang again, with no indication of what it is doing, before it prints the final “Writing 880 links…” output.
Overall, it seems like this process is slower than it should be. The lack of any logging output during the long pauses is exceptionally frustrating, since it makes it seem like the program has frozen and makes it difficult to debug what the problem may be. A flag to make the logging more verbose might be a good start to addressing this problem.
The slowness has kept me using the old pre-Django version of ArchiveBox, where it feels that the only performance limitation when adding new URLs is my network connection.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:10 (10 by maintainers)
Top GitHub Comments
That completely solves the problem.
When running the latest master, the
add
command takes an average of 3 minutes to add a single URL, despite it only taking 10-20 seconds to archive that URL. When using #502, the entire command only adds about a second on top of the archive time.Removing is also close to instant.
The
add
command still prints the hint suggesting that I load the HTML index. That should probably go away if that file is going to be out of date until I runarchivebox list --html >! index.html
.Well, we are in the middle of a big refactor. The json and html indexes are deprecated, but not yet removed. At the end of every command, they are being rebuild, which can still cause some delays. They will be completely removed from the process in a later iteration. I need to review the rm command, but the issue is probably the same. I will create a PR removing those bits so you can further experiment the speed improvements. It is a good idea to document that in a decently sized archive.