question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bugfix: Exception thrown from wget extractor

See original GitHub issue

Describe the bug

The Wget extractor is throwing an exception Failed to archive link: ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877', which appears to originate from this block of code.

Steps to reproduce

I am using archivebox add --update-all --depth 1 http://bookmarks?do=atom, where http://bookmarks?do=atom is my Shaarli instance. Essentially, I am archiving all my bookmarks.

The log output shows everything works fine for quite a while, then this exception occurs and the Docker container dies.

The folder referenced in the log output below is empty (/data/archive/1601694245.432877).

I believe that the archived link beforehand is fine (https://blog.thefactual.com/what-are-the-best-nonpartisan-news-sources), and that the exception is occurring on the next link, however I the log output doesn’t show me what link it is trying to archive. The /data/index.html file that the exception is referring to appears to be the main index file, so I am not sure if this is happening at the very end, after all links have been indexed, and it is trying to rebuild the main index.html file as a last step.

This is with a recently published Docker image nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf. I was previously using nikisweeting/archivebox:0.4.21, and when I switched to the newer image, I did not run any init or migration steps before calling add again.

Thank you for your time and help, let me know if there is additional information I can provide that would be useful.

BTW, I saw the comments in this issue describing how the main index is in the process of being removed. It very well could be that this issue is just a consequence of running code that is mid-refactor.

Screenshots or log output

2020-10-04T08:29:38.902862648Z [+] [2020-10-04 08:29:38] "blog.thefactual.com/what-are-the-best-nonpartisan-news-sources"
2020-10-04T08:29:38.902867437Z     https://blog.thefactual.com/what-are-the-best-nonpartisan-news-sources
2020-10-04T08:29:38.90287951Z     > ./archive/1601694245.427045
2020-10-04T08:29:38.903008686Z       > title
2020-10-04T08:29:39.469988358Z       > favicon
2020-10-04T08:29:40.605976312Z       > wget
2020-10-04T08:29:56.790239512Z       > pdf
2020-10-04T08:30:03.981854219Z       > screenshot
2020-10-04T08:30:09.238869897Z       > dom
2020-10-04T08:30:14.230718869Z       > readability
2020-10-04T08:30:15.530399373Z       > mercury
2020-10-04T08:30:18.555948774Z       > headers
2020-10-04T08:30:19.447750335Z     ! Failed to archive link: ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877'
2020-10-04T08:30:19.447779591Z 
2020-10-04T08:30:19.454165324Z Traceback (most recent call last):
2020-10-04T08:30:19.454182587Z   File "/usr/local/bin/archivebox", line 33, in <module>
2020-10-04T08:30:19.45570422Z     sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
2020-10-04T08:30:19.455721833Z   File "/app/archivebox/cli/__init__.py", line 123, in main
2020-10-04T08:30:19.456645155Z     run_subcommand(
2020-10-04T08:30:19.456662698Z   File "/app/archivebox/cli/__init__.py", line 63, in run_subcommand
2020-10-04T08:30:19.456722362Z     module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
2020-10-04T08:30:19.45673672Z   File "/app/archivebox/cli/archivebox_add.py", line 78, in main
2020-10-04T08:30:19.456934758Z     add(
2020-10-04T08:30:19.456943715Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.457280297Z     return func(*args, **kwargs)
2020-10-04T08:30:19.457289795Z   File "/app/archivebox/main.py", line 559, in add
2020-10-04T08:30:19.457937511Z     archive_links(all_links, overwrite=overwrite, out_dir=out_dir)
2020-10-04T08:30:19.457947921Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458028775Z     return func(*args, **kwargs)
2020-10-04T08:30:19.458039155Z   File "/app/archivebox/extractors/__init__.py", line 157, in archive_links
2020-10-04T08:30:19.458703713Z     archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
2020-10-04T08:30:19.458716236Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458788705Z     return func(*args, **kwargs)
2020-10-04T08:30:19.458797612Z   File "/app/archivebox/extractors/__init__.py", line 83, in archive_link
2020-10-04T08:30:19.458868156Z     write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
2020-10-04T08:30:19.45887508Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458962987Z     return func(*args, **kwargs)
2020-10-04T08:30:19.458979369Z   File "/app/archivebox/index/__init__.py", line 350, in write_link_details
2020-10-04T08:30:19.459758926Z     write_json_link_details(link, out_dir=out_dir)
2020-10-04T08:30:19.459772251Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.459850541Z     return func(*args, **kwargs)
2020-10-04T08:30:19.459865449Z   File "/app/archivebox/index/json.py", line 100, in write_json_link_details
2020-10-04T08:30:19.460144481Z     atomic_write(str(path), link._asdict(extended=True))
2020-10-04T08:30:19.460153118Z   File "/app/archivebox/index/schema.py", line 206, in _asdict
2020-10-04T08:30:19.460434244Z     'canonical': self.canonical_outputs(),
2020-10-04T08:30:19.460446568Z   File "/app/archivebox/index/schema.py", line 406, in canonical_outputs
2020-10-04T08:30:19.460610581Z     'wget_path': wget_output_path(self),
2020-10-04T08:30:19.460619568Z   File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.460691705Z     return func(*args, **kwargs)
2020-10-04T08:30:19.460699941Z   File "/app/archivebox/extractors/wget.py", line 182, in wget_output_path
2020-10-04T08:30:19.461027055Z     return str(html_files[0].relative_to(link.link_dir))
2020-10-04T08:30:19.461035802Z   File "/usr/local/lib/python3.8/pathlib.py", line 907, in relative_to
2020-10-04T08:30:19.46225102Z     raise ValueError("{!r} does not start with {!r}"
2020-10-04T08:30:19.462267782Z ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877'

Software versions

Docker image nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
cdvv7788commented, Oct 19, 2020

@jrruethe can you try with this branch? https://github.com/pirate/ArchiveBox/pull/502 you will need to build the image yourself: docker build -t archivebox --no-cache The actual archiving should be faster in that branch, making testing easier.

1reaction
jrruethecommented, Oct 22, 2020

No suffering at all, I have it running on a cronjob in the background and I don’t check on it too often, so I didn’t notice for a while.

Honestly, Archivebox handles the large archive pretty well, I’ve been impressed.

I haven’t had a chance to try out the #502 branch yet, I’ll see if I can get to it soon.

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

GNU Wget 1.21.1-dirty Manual
1 Overview. GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols,...
Read more >
Bug listing with status RESOLVED with resolution FIXED as at ...
Bug:2 - "How do I attach an ebuild." status:RESOLVED resolution:FIXED severity:normal · Bug:3 - "poedit-1.1.5.ebuild" status:RESOLVED resolution:FIXED ...
Read more >
Frequently Asked Questions (FAQs) - Snap Creek Software
... and extract it with a program like winrar/winzip and it will throw errors ... EXCEPTION message: DB ERROR: Could not get the...
Read more >
yt-dlp/Changelog.md at master - GitHub
Redirect channels that doesn't have a videos tab to their UU playlists; Support in-channel search; Sort audio-only formats correctly; Always extract ...
Read more >
How to redirect output of wget as input to unzip?
When I download and pipe into bsdtar , the exec bits get thrown away. When I download to disk and extract with bsdtar...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found