Intermittent network response dropping when building and executing inside docker
See original GitHub issueIt seems like the pocket RSS feeds are not being parsed correctly and fragments of the XML / HTML tags are being included in the links. Here’s how to reproduce this:
docker-compose exec archivebox /bin/archive http://getpocket.com/users/*[redacted]/feed/all
I created a pocket-account with two links in it, the corresponding RSS that is being downloaded looks like this:
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>My Reading List: Read and Unread</title>
<description>Items I've saved to read</description>
<link>http://readitlaterlist.com/users/*[redacted]/feed/all</link>
<atom:link href="http://readitlaterlist.com/users/*[redacted]/feed/all" rel="self" type="application/rss+xml" />
<item>
<title><![CDATA[Trump Agrees to Reopen Government for 3 Weeks in Surprise Retreat From Wall]]></title>
<category>Unread</category>
<link>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</link>
<guid>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</guid>
<pubDate>Fri, 25 Jan 2019 16:21:38 -0600</pubDate>
</item>
<item>
<title><![CDATA[Neue Passwort-Leaks: Insgesamt 2,2 Milliarden Accounts betroffen]]></title>
<category>Unread</category>
<link>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</link>
<guid>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
<pubDate>Fri, 25 Jan 2019 16:20:07 -0600</pubDate>
</item>
</channel>
</rss>
Instead of the two <link>
s, the software now tries to pull in 10 links and seems to mess up the URLs:
[▶] [2019-01-25 22:30:05] Updating files for 10 links in archive...
[+] [2019-01-25 22:30:09] "https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>"
https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> /data/archive/1548455383 (new)
> favicon
> wget
Got wget response code 8:
https://www.heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html%3c/guid%3e:
2019-01-25 22:30:12 ERROR 404: Not Found.
Some resources were skipped: 404 Not Found
Run to see full output:
cd /data/archive/1548455383;
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548455410 --page-requisites --user-agent="ArchiveBox/544de6831 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> pdf
> screenshot
> dom
> archive_org
Failed: Exception BadQueryException: Illegal character in path at index 110: https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
Run to see full output:
curl --location --head --max-time 60 --get https://web.archive.org/save/https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> git
√ index.json
√ index.html
(note the <guid>
at the end of the URL wget is trying to download.
In the end, no links could be saved:
[√] [2019-01-25 22:35:50] Update of 10 links complete (5.75 min)
- 10 entries skipped
- 44 entries updated
- 16 errors
Latest stable version.
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (7 by maintainers)
Top Results From Across the Web
Containers and docker commands constantly losing network ...
Docker and related commands run with stability. Actual behavior. I have to restart docker at least 2-3 times a day when I lose...
Read more >Fix a random network Connection Reset issue in Docker ...
This article describes my recent experience to fix a random network “Connection Reset” issue in CI/CD pipelines running in Docker/Kubernetes ...
Read more >Docker container loses network connectivity intermittently
The issue: When a container is newly created, all networking will work as expected; it can ping out to the internet, and connect...
Read more >2021-09-26: Intermittent networking issues with some shared ...
I'm seeing that docker network inspect bridge in the build is showing "com.docker.network.driver.mtu": "1500" - should that be getting limited ...
Read more >A reason for unexplained connection timeouts on Kubernetes ...
The Linux Kernel has a known race condition when doing source network address translation (SNAT) that can lead to SYN packets being dropped....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So… even more… I changed
puppeteer
forpuppeteer-core
(a version of Puppeteer that doesn’t download Chromium by default) in the Dockerfile, because we’re installing chromium anyways separately. This at first failed as well:There seems to be something going on either with my network connection or the npm servers. I tried again:
This finally did work. Not sure about the tarball errors.
Back to the original purpose of the ticket, pocket feeds not being properly imported: I tried the same RSS feed and this time my two links were parsed / downloaded correctly; screenshot, html, pdf confirmed and working.
Thanks again for your support and this project. Love it and i think it’s very important. You might want to consider
puppeteer-core
.OK, I did more digging. I edited the
Dockerfile
to include anRUN npm cache clean --force
before the puppeteer (now step 8 instead of 7) installation, but no luck there as well:I then reduced the Dockerfile to the bare minimum to see if that would give me any clue:
But still (this time errored out on the same spot):
So the error must be within the npm package of puppeteer?!