Bugfix: Exception thrown from wget extractor
See original GitHub issueDescribe the bug
The Wget extractor is throwing an exception Failed to archive link: ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877'
, which appears to originate from this block of code.
Steps to reproduce
I am using archivebox add --update-all --depth 1 http://bookmarks?do=atom
, where http://bookmarks?do=atom
is my Shaarli instance. Essentially, I am archiving all my bookmarks.
The log output shows everything works fine for quite a while, then this exception occurs and the Docker container dies.
The folder referenced in the log output below is empty (/data/archive/1601694245.432877
).
I believe that the archived link beforehand is fine (https://blog.thefactual.com/what-are-the-best-nonpartisan-news-sources
), and that the exception is occurring on the next link, however I the log output doesn’t show me what link it is trying to archive. The /data/index.html
file that the exception is referring to appears to be the main index file, so I am not sure if this is happening at the very end, after all links have been indexed, and it is trying to rebuild the main index.html file as a last step.
This is with a recently published Docker image nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf
. I was previously using nikisweeting/archivebox:0.4.21
, and when I switched to the newer image, I did not run any init
or migration steps before calling add
again.
Thank you for your time and help, let me know if there is additional information I can provide that would be useful.
BTW, I saw the comments in this issue describing how the main index is in the process of being removed. It very well could be that this issue is just a consequence of running code that is mid-refactor.
Screenshots or log output
2020-10-04T08:29:38.902862648Z [+] [2020-10-04 08:29:38] "blog.thefactual.com/what-are-the-best-nonpartisan-news-sources"
2020-10-04T08:29:38.902867437Z https://blog.thefactual.com/what-are-the-best-nonpartisan-news-sources
2020-10-04T08:29:38.90287951Z > ./archive/1601694245.427045
2020-10-04T08:29:38.903008686Z > title
2020-10-04T08:29:39.469988358Z > favicon
2020-10-04T08:29:40.605976312Z > wget
2020-10-04T08:29:56.790239512Z > pdf
2020-10-04T08:30:03.981854219Z > screenshot
2020-10-04T08:30:09.238869897Z > dom
2020-10-04T08:30:14.230718869Z > readability
2020-10-04T08:30:15.530399373Z > mercury
2020-10-04T08:30:18.555948774Z > headers
2020-10-04T08:30:19.447750335Z ! Failed to archive link: ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877'
2020-10-04T08:30:19.447779591Z
2020-10-04T08:30:19.454165324Z Traceback (most recent call last):
2020-10-04T08:30:19.454182587Z File "/usr/local/bin/archivebox", line 33, in <module>
2020-10-04T08:30:19.45570422Z sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
2020-10-04T08:30:19.455721833Z File "/app/archivebox/cli/__init__.py", line 123, in main
2020-10-04T08:30:19.456645155Z run_subcommand(
2020-10-04T08:30:19.456662698Z File "/app/archivebox/cli/__init__.py", line 63, in run_subcommand
2020-10-04T08:30:19.456722362Z module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
2020-10-04T08:30:19.45673672Z File "/app/archivebox/cli/archivebox_add.py", line 78, in main
2020-10-04T08:30:19.456934758Z add(
2020-10-04T08:30:19.456943715Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.457280297Z return func(*args, **kwargs)
2020-10-04T08:30:19.457289795Z File "/app/archivebox/main.py", line 559, in add
2020-10-04T08:30:19.457937511Z archive_links(all_links, overwrite=overwrite, out_dir=out_dir)
2020-10-04T08:30:19.457947921Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458028775Z return func(*args, **kwargs)
2020-10-04T08:30:19.458039155Z File "/app/archivebox/extractors/__init__.py", line 157, in archive_links
2020-10-04T08:30:19.458703713Z archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
2020-10-04T08:30:19.458716236Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458788705Z return func(*args, **kwargs)
2020-10-04T08:30:19.458797612Z File "/app/archivebox/extractors/__init__.py", line 83, in archive_link
2020-10-04T08:30:19.458868156Z write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
2020-10-04T08:30:19.45887508Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.458962987Z return func(*args, **kwargs)
2020-10-04T08:30:19.458979369Z File "/app/archivebox/index/__init__.py", line 350, in write_link_details
2020-10-04T08:30:19.459758926Z write_json_link_details(link, out_dir=out_dir)
2020-10-04T08:30:19.459772251Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.459850541Z return func(*args, **kwargs)
2020-10-04T08:30:19.459865449Z File "/app/archivebox/index/json.py", line 100, in write_json_link_details
2020-10-04T08:30:19.460144481Z atomic_write(str(path), link._asdict(extended=True))
2020-10-04T08:30:19.460153118Z File "/app/archivebox/index/schema.py", line 206, in _asdict
2020-10-04T08:30:19.460434244Z 'canonical': self.canonical_outputs(),
2020-10-04T08:30:19.460446568Z File "/app/archivebox/index/schema.py", line 406, in canonical_outputs
2020-10-04T08:30:19.460610581Z 'wget_path': wget_output_path(self),
2020-10-04T08:30:19.460619568Z File "/app/archivebox/util.py", line 113, in typechecked_function
2020-10-04T08:30:19.460691705Z return func(*args, **kwargs)
2020-10-04T08:30:19.460699941Z File "/app/archivebox/extractors/wget.py", line 182, in wget_output_path
2020-10-04T08:30:19.461027055Z return str(html_files[0].relative_to(link.link_dir))
2020-10-04T08:30:19.461035802Z File "/usr/local/lib/python3.8/pathlib.py", line 907, in relative_to
2020-10-04T08:30:19.46225102Z raise ValueError("{!r} does not start with {!r}"
2020-10-04T08:30:19.462267782Z ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877'
Software versions
Docker image nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
@jrruethe can you try with this branch? https://github.com/pirate/ArchiveBox/pull/502 you will need to build the image yourself:
docker build -t archivebox --no-cache
The actual archiving should be faster in that branch, making testing easier.No suffering at all, I have it running on a cronjob in the background and I don’t check on it too often, so I didn’t notice for a while.
Honestly, Archivebox handles the large archive pretty well, I’ve been impressed.
I haven’t had a chance to try out the #502 branch yet, I’ll see if I can get to it soon.
Thanks!