Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shaarli RSS parsing falls back to full-text and imports unneeded URLs from metadata fields

See original GitHub issue

It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as source links, making the import much slower and leading to clutter in the archive.

You can use the public shaarli demo to reproduce this.

There’s a demo (U: demo / PW: demo) running on https://demo.shaarli.org/.

Add whatever link to this instance

The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file)

<?xml  version="1.0" encoding="UTF-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Shaarli demo (master)</title>
  <subtitle>Shaared links</subtitle>
  
    <updated>2019-01-30T06:06:01+00:00</updated>
  
  <link rel="self" href="https://demo.shaarli.org/?do=atom" />
  
  <author>
    <name>https://demo.shaarli.org/</name>
    <uri>https://demo.shaarli.org/</uri>
  </author>
  <id>https://demo.shaarli.org/</id>
  <generator>Shaarli</generator>
  
    <entry>
      <title>Aktuelle Trojaner-Welle: Emotet lauert in gefÃ¤lschten Rechnungsmails | heise online</title>
      
        <link href="https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html" />
      
      <id>https://demo.shaarli.org/?cEV4vw</id>
      
        <published>2019-01-30T06:06:01+00:00</published>
        <updated>2019-01-30T06:06:01+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?cEV4vw">Permalink</a></p></div>]]></content>
      
      
    </entry>
  
</feed>

Note that ArchiveBox wants to include 8 links from this:

Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json

Most likely because 8 instances of http:// were found (that’s just my speculation). However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like https://demo.shaarli.org/?cEV4vw that contain nothing but the actual link to the source (again). IMO that doesn’t make sense. It’s even “worse” if a link has tags, because every tag then will lead to a new link to be crawled.

Grab the Atom Feed https://demo.shaarli.org/?do=atom and import to ArchiveBox: docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
You will see that markup fragments end up in the parser:

root@NASi:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
[*] [2019-01-30 06:10:43] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1548828643.txt
[+] [2019-01-30 06:11:02] Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json
[√] [2019-01-30 06:11:18] Updated main index files:
    > /data/index.json
    > /data/index.html
[▶] [2019-01-30 06:11:18] Updating files for 8 links in archive...
[+] [2019-01-30 06:11:27] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)"
    https://demo.shaarli.org/?cEV4vw
    > /data/archive/1548828660 (new)
      > favicon
      > wget
        Got wget response code 8:
          Total wall clock time: 5.1s
          Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828660;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828689 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:11:50] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)"
    https://demo.shaarli.org/?cEV4vw</id>
    > /data/archive/1548828659 (new)
      > favicon
      > wget
        Got wget response code 8:
          Total wall clock time: 5.1s
          Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828659;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828710 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw</id>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in query at index 32: https://demo.shaarli.org/?cEV4vw</id>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/?cEV4vw</id>
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:12:10] "comments_outline_white"
    https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
    > /data/archive/1548828658 (new)
      > favicon
      > wget
        Got wget response code 4:
          Total wall clock time: 38s
          Downloaded: 128 files, 6.0M in 12s (502 KB/s)
        Some resources were skipped: Got an error from the server
        Run to see full output:
            cd /data/archive/1548828658;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828730 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
        got youtubedl response code 1:
b'ERROR: Unable to extract container ID; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n'
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828658;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:06] "https://demo.shaarli.org/</id>"
    https://demo.shaarli.org/</id>
    > /data/archive/1548828657 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/id%3E:
          2019-01-30 06:13:07 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828657;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828786 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</id>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</id>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</id>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</id>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828657;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</id>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:16] "https://demo.shaarli.org/</uri>"
    https://demo.shaarli.org/</uri>
    > /data/archive/1548828656 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/uri%3E:
          2019-01-30 06:13:17 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828656;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828796 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</uri>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</uri>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</uri>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</uri>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828656;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</uri>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:25] "Shaarli demo (master)"
    https://demo.shaarli.org/?do=atom
    > /data/archive/1548828655 (new)
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:36] "https://demo.shaarli.org/</name>"
    https://demo.shaarli.org/</name>
    > /data/archive/1548828655.0 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/name%3E:
          2019-01-30 06:13:37 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828655.0;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828816 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</name>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</name>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</name>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</name>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828655.0;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</name>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:45] "http://www.w3.org/2005/Atom"
    http://www.w3.org/2005/Atom
    > /data/archive/1548828644 (new)
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception LiveDocumentNotAvailableException: http://www.w3.org/2005/Atom: live document unavailable: java.net.SocketTimeoutException: Read timed out
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/http://www.w3.org/2005/Atom
      > git
      > media
      √ index.json
      √ index.html
[√] [2019-01-30 06:15:28] Update of 8 links complete (4.17 min)
    - 8 entries skipped
    - 41 entries updated
    - 15 errors

(note the </id> at the end of the links)

Issue Analytics

State:
Created 5 years ago
Comments:26 (12 by maintainers)

Top GitHub Comments

1reaction

piratecommented, Apr 12, 2022

w3.org and purl.org are expected in full-text parsing mode (which it’s falling back to due to a bug) because they are linked to in the RSS even though the links aren’t visible, they wont archive multiple times so I recommend leaving them for now and ignoring those entries.

I’ve re-opened the issue to track fixing it, PRs to fix are welcome.

1reaction

mawmawmawmcommented, Apr 1, 2019

Sorry for the late reply - tried it 3 days ago and was working fine except the wget issue mentioned in the other ticket.

Top Results From Across the Web

Shaarli RSS parsing falls back to full-text and imports unneeded ...

It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134...

Ask HN: Does anybody still use bookmarking services?

Having imported my delicious bookmarks dating back to 2005 or so, I have a fairly large set of links that I try to...

ArchiveBox | Open Source Self-hosted Web Archiving ... - Morioh

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline and Written in Python....

How to Import Articles Using RSS Feed From Another Site ...

Feed URL - The source URL of the feed. · Validate - Click on this link to check if the feed is valid....

selfhosted - GitHub Pages

Mail-in-a-Box helps individuals take back control of their email by defining a one-click, easy-to-deploy SMTP+everything else server: a mail server in a box....