question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Pocket `since` high-water-mark gets set even when indexing fails

See original GitHub issue

Describe the bug

I’m setting up Pocket importing for the first time, meaning I’m importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the since value in pocket_api.db is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived.

Steps to reproduce

  1. Set up Pocket config, per #528
  2. Have a URL in pocket on a domain which refuses connections, or does not exist
  3. Import from Pocket:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619618411-import.txt
        > Parsed 169 URLs from input (Pocket API)
    
    [*] Starting crawl of 169 sites 1 hop out from starting point
        > Downloading http://my-working-url.com/ contents
        > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt
        > Parsed 12 URLs from input (Generic TXT)
        > Downloading http://my-defunct-url.com/ contents
    [!] Failed to download http://my-defunct-url.com/
    
         HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    
  4. Remove broken URL from Pocket
  5. Try importing again:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619620775-import.txt
                                                                                                                                   0.1% (0/240sec)
    [X] No links found using Pocket API parser
        Hint: Try a different parser or double check the input?
    
        > Parsed 0 URLs from input (Pocket API)
        > Found 0 new URLs not already in index
    
    [*] [2021-04-28 14:39:35] Writing 0 links to main index...
        √ ./index.sqlite3
    

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.8          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.07     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data
 √  SOURCES_DIR           28 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           0 files         valid     ./archive
 √  CONFIG_FILE           204.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mAAdhaTTahcommented, Apr 28, 2021

The URLs should still be added to your database, so if you need to download those urls, you can run archivebox update to go through your db and do that process.

@pirate My understanding is this isn’t specific to the Pocket API implementation though, is it?

0reactions
piratecommented, Apr 28, 2021

Ah yeah, you’re right, this is pocket api specific then.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stuck in an infinite sync state after an error during indexing
Run wp elasticpress index --setup again; You'll receive an error "An index is already occuring. Try again later." and there's no way of...
Read more >
Rebuilding Indexes - Ask TOM
Doc says that when we rebuild index online, the changes are allowed and they are stored in a "journal table". What happens to...
Read more >
devblog - git-annex - Branchable
Since the new git-annex doesn't support working in v5 repos, setting that will make every command except git annex upgrade fail.
Read more >
Identifying and Preserving High-Water Mark Data
Prominent identification of high-water marks left behind after a major riverine flood or storm tide, through signage or other means, can be a...
Read more >
How Did the ACA's Individual Mandate Affect Insurance ...
mandate penalty” since removal of the penalty renders the mandate ... income with some error, and the time period over which income is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found