wget Errors on latest master
See original GitHub issueDescribe the bug
wget
times out after 30 seconds on the latest build of master
branch. When same wget command is run outside of ArchiveBox wget works as expected
Steps to reproduce
Steps to reproduce the behavior:
- Use the following .ArchiveBox.config options
# Example config file for ArchiveBox: The self-hosted internet archive.
# Copy this file to ~/.ArchiveBox.conf before editing it.
# Config file is in both Python and .env syntax (all strings must be quoted).
# For documentation, see:
# https://github.com/pirate/ArchiveBox/wiki/Configuration
################################################################################
## General Settings
################################################################################
OUTPUT_PERMISSIONS=644
ONLY_NEW=True
TIMEOUT=30
MEDIA_TIMEOUT=3600
#TEMPLATES_DIR="archivebox/templates"
FOOTER_INFO="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests."
FETCH_TITLE=True
FETCH_FAVICON=True
FETCH_WGET=True
FETCH_WARC=True
FETCH_PDF=True
FETCH_SCREENSHOT=False
FETCH_DOM=True
FETCH_GIT=True
FETCH_MEDIA=True
SUBMIT_ARCHIVE_DOT_ORG=False
#CHECK_SSL_VALIDITY=True
FETCH_WGET_REQUISITES=True
RESOLUTION="1440,900"
WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
HEADLESS_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"
GIT_DOMAINS="github.com,bitbucket.org,gitlab.com"
#COOKIES_FILE="path/to/cookies.txt"
#CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"
USE_COLOR=false
SHOW_PROGRESS=false
-
Run ./archive `echo “https://developer.apple.com/library/archive/technotes/tn2218/_index.html#//apple_ref/doc/uid/DTS40007625” | ./archive
-
See error
Screenshots or log output
wget Failed:TimeoutExpired Command ‘/usr/local/bin/wget’ timed out after 30 seconds Run to see full output: cd /Volumes/home/www/archive/1553194400.182; /usr/local/bin/wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=30 --warc-file=warc/1553194992 --page-requisites “–user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36” https://developer.apple.com/library/archive/technotes/tn2219/_index.html#//apple_ref/doc/uid/DTS10004624
Software versions
(please complete the following information)
- OS: macOS 10.14
- ArchiveBox version: d798117
- Python version: Python 3.7.2
- Wget version: GNU Wget 1.19.5 built on darwin17.5.0.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
Update: So I checked my settings and it looks like my NAS was mounting using SMB 2 by default. I’ve since changed this to SMB 3 which should help with any disk I/O issues resulting from network latency.
I’m running ArchiveBox again on the same data set as before and the issue seems to be resolved with an average archive time of 15 seconds per link which is back to a fairly decent speed.
Somehow appears to have resolved itself although wget does appear to have been severely slowed down by something in the commits between
c79e1df
andd798117
and I’m getting throughput of 1 url archived every 30 or so seconds