Bug: Fails to parse list of URLs txt file
See original GitHub issueDescribe the bug
I can’t seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs. Based on error message it fails to parse it. I may be doing something wrong.
Steps to reproduce
- Create txt file with some URLs. Eg.
https://www.example.com/
https://example.com/
- Run
archivebox add /tmp/urls.txt
Screenshots or log output
Here’s the output I get:
ross@xx> archivebox add /tmp/urls.txt /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
> /tmp/archivebox
[!] Warning: Missing 3 recommended dependencies
! SINGLEFILE_BINARY: single-file (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
! READABILITY_BINARY: readability-extractor (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
! MERCURY_BINARY: mercury-parser (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1650470713-import.txt
0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
> Parsed 0 URLs from input (Failed to parse)
> Found 0 new URLs not already in index
[*] [2022-04-20 16:05:13] Writing 0 links to main index...
√ ./index.sqlite3
ArchiveBox version
ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.2 valid /home/ross/.local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.82.0 valid /usr/bin/curl
√ WGET_BINARY v1.21.3 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
X SINGLEFILE_BINARY ? invalid single-file
X READABILITY_BINARY ? invalid readability-extractor
X MERCURY_BINARY ? invalid mercury-parser
√ GIT_BINARY v2.35.2 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2021.12.17 valid /home/ross/.local/bin/youtube-dl
√ CHROME_BINARY v100.0.4896.88 valid /usr/bin/chromium
√ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 23 files valid /home/ross/.local/lib/python3.10/site-packages/archivebox
√ TEMPLATES_DIR 3 files valid /home/ross/.local/lib/python3.10/site-packages/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 5 files valid /tmp/archivebox
√ SOURCES_DIR 3 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 0 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 204.0 KB valid ./index.sqlite3
[!] Warning: Missing 3 recommended dependencies
! SINGLEFILE_BINARY: single-file (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
! READABILITY_BINARY: readability-extractor (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
! MERCURY_BINARY: mercury-parser (unable to detect version)
Hint: To install all packages automatically run: archivebox setup
or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Error When Opening URL's from a text file - Stack Overflow
The first step is loop through the list of entries picked up from the text file. You need to remove the newline character....
Read more >Support for URLs in input requirements.txt files #18 - GitHub
Hi,. This is a feature request. ... It raises a RequirementParseError: Invalid requirement, parse error . ... By the way, URLs to tarball...
Read more >Python Urllib Module - GeeksforGeeks
Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function...
Read more >Build and Submit a Sitemap | Google Search Central
Google supports several sitemap formats. Follow this guide to learn about formats, how to build a sitemap, and how to submit a sitemap...
Read more >MissingSchema: Invalid URL ' ': No schema supplied
It is saying you have an invalid URL, and empty URL in this case. If the URLs you posted here came from the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new
--depth=1
implementation!I’ll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing
--depth=1 /path/to/file.txt
also works.I’ve also tried this using on a fresh docker image based installation and it fails similarly:
/tmp/ff/urls.txt being the same simple file: