question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Intermittent network response dropping when building and executing inside docker

See original GitHub issue

It seems like the pocket RSS feeds are not being parsed correctly and fragments of the XML / HTML tags are being included in the links. Here’s how to reproduce this:

docker-compose exec archivebox /bin/archive http://getpocket.com/users/*[redacted]/feed/all

I created a pocket-account with two links in it, the corresponding RSS that is being downloaded looks like this:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    >

<channel>

<title>My Reading List: Read and Unread</title>
<description>Items I've saved to read</description>
<link>http://readitlaterlist.com/users/*[redacted]/feed/all</link>
<atom:link href="http://readitlaterlist.com/users/*[redacted]/feed/all" rel="self" type="application/rss+xml" />


<item>
<title><![CDATA[Trump Agrees to Reopen Government for 3 Weeks in Surprise Retreat From Wall]]></title>
<category>Unread</category>
<link>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</link>
<guid>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</guid>
<pubDate>Fri, 25 Jan 2019 16:21:38 -0600</pubDate>
</item>
<item>
<title><![CDATA[Neue Passwort-Leaks: Insgesamt 2,2 Milliarden Accounts betroffen]]></title>
<category>Unread</category>
<link>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</link>
<guid>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
<pubDate>Fri, 25 Jan 2019 16:20:07 -0600</pubDate>
</item>
</channel>

</rss>

Instead of the two <link> s, the software now tries to pull in 10 links and seems to mess up the URLs:

[▶] [2019-01-25 22:30:05] Updating files for 10 links in archive...
[+] [2019-01-25 22:30:09] "https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>"
    https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
    > /data/archive/1548455383 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://www.heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html%3c/guid%3e:
          2019-01-25 22:30:12 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548455383;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548455410 --page-requisites --user-agent="ArchiveBox/544de6831 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 110: https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
      > git
      √ index.json
      √ index.html

(note the <guid> at the end of the URL wget is trying to download.

In the end, no links could be saved:

[√] [2019-01-25 22:35:50] Update of 10 links complete (5.75 min)
    - 10 entries skipped
    - 44 entries updated
    - 16 errors

Latest stable version.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
mawmawmawmcommented, Jan 28, 2019

So… even more… I changed puppeteer for puppeteer-core (a version of Puppeteer that doesn’t download Chromium by default) in the Dockerfile, because we’re installing chromium anyways separately. This at first failed as well:

Step 7/15 : RUN npm i puppeteer-core
 ---> Running in 4cfe4c562904
npm ERR! code EPROTO
npm ERR! errno EPROTO
npm ERR! request to https://registry.npmjs.org/rimraf failed, reason: write EPROTO 139977009982336:error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1407:SSL alert number 40
npm ERR!

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T05_08_25_248Z-debug.log
ERROR: Service 'archivebox' failed to build: The command '/bin/sh -c npm i puppeteer-core' returned a non-zero code: 1

There seems to be something going on either with my network connection or the npm servers. I tried again:

Step 7/15 : RUN npm i puppeteer-core
 ---> Running in e1b3a79eaf9c
npm WARN tarball tarball data for es6-promise@^4.0.3 (sha512-n6wvpdE43VFtJq+lUDYDBFUwV8TZbuGXLV4D6wKafg13ldznKsyEvatubnmUe31zcvelSzOHF+XbaT+Bl9ObDg==) seems to be corrupted. Trying one more time.
npm WARN tarball tarball data for puppeteer-core@latest (sha512-JTsJKCQdrk1RqEGZN3l2TyW7Rhy7GWRRzd3PftYyA3B35l0t0lLU+gdF7czemnpSVVMiAgHpM1Uk/iO6jLreMA==) seems to be corrupted. Trying one more time.

> puppeteer-core@1.11.0 install /node_modules/puppeteer-core
> node install.js

npm WARN saveError ENOENT: no such file or directory, open '/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/package.json'
npm WARN !invalid#1 No description
npm WARN !invalid#1 No repository field.
npm WARN !invalid#1 No README data
npm WARN !invalid#1 No license field.

+ puppeteer-core@1.11.0
added 43 packages from 22 contributors and audited 50 packages in 14.773s
found 0 vulnerabilities

 ---> 7538b1c16fbc

This finally did work. Not sure about the tarball errors.

Back to the original purpose of the ticket, pocket feeds not being properly imported: I tried the same RSS feed and this time my two links were parsed / downloaded correctly; screenshot, html, pdf confirmed and working.

Thanks again for your support and this project. Love it and i think it’s very important. You might want to consider puppeteer-core .

1reaction
mawmawmawmcommented, Jan 28, 2019

OK, I did more digging. I edited the Dockerfile to include an RUN npm cache clean --force before the puppeteer (now step 8 instead of 7) installation, but no luck there as well:


 ---> 73357b1217dc
Removing intermediate container 84e268fc0b12
Step 6/16 : RUN chmod +x /usr/local/bin/dumb-init
 ---> Running in bd8430cbfbf9
 ---> dcaaf479c297
Removing intermediate container bd8430cbfbf9
Step 7/16 : RUN npm cache clean --force
 ---> Running in 7d6f8353ba49
npm WARN using --force I sure hope you know what you are doing.
 ---> 22fe375cd41a
Removing intermediate container 7d6f8353ba49
Step 8/16 : RUN npm i puppeteer
 ---> Running in 8a5d51af5bac
npm ERR! Unexpected end of JSON input while parsing near '...s/extract-zip":"^1.6.'

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T04_26_16_469Z-debug.log

I then reduced the Dockerfile to the bare minimum to see if that would give me any clue:

FROM node:11-slim
LABEL maintainer="Nick Sweeting <archivebox-git@sweeting.me>"

# RUN apt-get update \
#    && apt-get install -yq --no-install-recommends \
#        git wget curl youtube-dl gnupg2 libgconf-2-4 python3 python3-pip \
#    && rm -rf /var/lib/apt/lists/*

# Install latest chrome package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
RUN apt-get update && apt-get install -y wget --no-install-recommends \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /src/*.deb

# It's a good idea to use dumb-init to help prevent zombie chrome processes.
#ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
#RUN chmod +x /usr/local/bin/dumb-init

# Do a npm clean
#RUN npm cache clean --force

# Install puppeteer so it's available in the container.
RUN npm i puppeteer

# Add user so we don't need --no-sandbox.
#RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
#    && mkdir -p /home/pptruser/Downloads \
#    && chown -R pptruser:pptruser /home/pptruser \
#    && chown -R pptruser:pptruser /node_modules

# Install the ArchiveBox repository and pip requirements
#RUN git clone https://github.com/pirate/ArchiveBox /home/pptruser/app \
#    && mkdir -p /data \
#    && chown -R pptruser:pptruser /data \
#    && ln -s /data /home/pptruser/app/archivebox/output \
#    && ln -s /home/pptruser/app/bin/archivebox /bin/archive \
#    && chown -R pptruser:pptruser /home/pptruser/app/archivebox
#    # && pip3 install -r /home/pptruser/app/archivebox/requirements.txt

VOLUME /data

ENV LANG=C.UTF-8 \
    LANGUAGE=en_US:en \
    LC_ALL=C.UTF-8 \
    PYTHONIOENCODING=UTF-8 \
    CHROME_SANDBOX=False \

But still (this time errored out on the same spot):

Step 4/10 : RUN npm i puppeteer
 ---> Running in cdcae8339d94
npm ERR! Unexpected end of JSON input while parsing near '...s/extract-zip":"^1.6.'

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T04_35_30_363Z-debug.log
ERROR: Service 'archivebox' failed to build: The command '/bin/sh -c npm i puppeteer' returned a non-zero code: 1

So the error must be within the npm package of puppeteer?!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Containers and docker commands constantly losing network ...
Docker and related commands run with stability. Actual behavior. I have to restart docker at least 2-3 times a day when I lose...
Read more >
Fix a random network Connection Reset issue in Docker ...
This article describes my recent experience to fix a random network “Connection Reset” issue in CI/CD pipelines running in Docker/Kubernetes ...
Read more >
Docker container loses network connectivity intermittently
The issue: When a container is newly created, all networking will work as expected; it can ping out to the internet, and connect...
Read more >
2021-09-26: Intermittent networking issues with some shared ...
I'm seeing that docker network inspect bridge in the build is showing "com.docker.network.driver.mtu": "1500" - should that be getting limited ...
Read more >
A reason for unexplained connection timeouts on Kubernetes ...
The Linux Kernel has a known race condition when doing source network address translation (SNAT) that can lead to SYN packets being dropped....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found