Bug: Pocket `since` high-water-mark gets set even when indexing fails
See original GitHub issueDescribe the bug
I’m setting up Pocket importing for the first time, meaning I’m importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the since
value in pocket_api.db
is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived.
Steps to reproduce
- Set up Pocket config, per #528
- Have a URL in pocket on a domain which refuses connections, or does not exist
- Import from Pocket:
$ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619618411-import.txt > Parsed 169 URLs from input (Pocket API) [*] Starting crawl of 169 sites 1 hop out from starting point > Downloading http://my-working-url.com/ contents > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt > Parsed 12 URLs from input (Generic TXT) > Downloading http://my-defunct-url.com/ contents [!] Failed to download http://my-defunct-url.com/ HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known'))
- Remove broken URL from Pocket
- Try importing again:
$ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619620775-import.txt 0.1% (0/240sec) [X] No links found using Pocket API parser Hint: Try a different parser or double check the input? > Parsed 0 URLs from input (Pocket API) > Found 0 new URLs not already in index [*] [2021-04-28 14:39:35] Writing 0 links to main index... √ ./index.sqlite3
ArchiveBox version
ArchiveBox v0.6.2
Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9
√ DJANGO_BINARY v3.1.8 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.64.0 valid /usr/bin/curl
√ WGET_BINARY v1.20.1 valid /usr/bin/wget
√ NODE_BINARY v15.14.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.20.1 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2021.04.07 valid /usr/local/bin/youtube-dl
√ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium
√ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 22 files valid /app/archivebox
√ TEMPLATES_DIR 3 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 28 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 0 files valid ./archive
√ CONFIG_FILE 204.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 204.0 KB valid ./index.sqlite3
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Stuck in an infinite sync state after an error during indexing
Run wp elasticpress index --setup again; You'll receive an error "An index is already occuring. Try again later." and there's no way of...
Read more >Rebuilding Indexes - Ask TOM
Doc says that when we rebuild index online, the changes are allowed and they are stored in a "journal table". What happens to...
Read more >devblog - git-annex - Branchable
Since the new git-annex doesn't support working in v5 repos, setting that will make every command except git annex upgrade fail.
Read more >Identifying and Preserving High-Water Mark Data
Prominent identification of high-water marks left behind after a major riverine flood or storm tide, through signage or other means, can be a...
Read more >How Did the ACA's Individual Mandate Affect Insurance ...
mandate penalty” since removal of the penalty renders the mandate ... income with some error, and the time period over which income is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The URLs should still be added to your database, so if you need to download those urls, you can run
archivebox update
to go through your db and do that process.@pirate My understanding is this isn’t specific to the Pocket API implementation though, is it?
Ah yeah, you’re right, this is pocket api specific then.