Bugfixes: Large crawls eventually crash during json loading/dumping
See original GitHub issueDescribe the bug
This is yet another 0.4.1 bug, feel free to close it but do notice that I can’t upgrade either. 😉
Steps to reproduce
- Ran ArchiveBox with around 10,000 URLs to crawl
- Wait around 3 hours
- Crawl eventually crashes with:
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
Screenshots or log output
Full backtrace:
[+] [2019-05-07 03:38:34] "www.varnish-software.com/blog/introducing-varnish-massive-storage-engine"
https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine
> ./archive/1557189662.144
> title
Failed:
HTTPError HTTP Error 404: Not Found
Run to see full output:
cd /srv/backup/archive/archivebox/archive/1557189662.144;
curl https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine | grep <title
> favicon
> wget
Failed:
Got an error from the server
Got wget response code: 8.
https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine:
2019-05-07 03:38:35 erreur 404 : Not Found.
Run to see full output:
cd /srv/backup/archive/archivebox/archive/1557189662.144;
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557200315 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine
> pdf
> screenshot
> dom
> media
! Failed to archive link: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
Traceback (most recent call last):
File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>
sys.exit(main())
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main
pwd=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
out_dir=pwd or OUTPUT_DIR,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add
archive_link(link, out_dir=link.link_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 84, in archive_link
write_link_details(link, out_dir=link.link_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 345, in write_link_details
write_json_link_details(link, out_dir=out_dir)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 89, in write_json_link_details
atomic_write(link._asdict(extended=True), path)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 73, in atomic_write
pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
File "/usr/lib/python3.7/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/usr/lib/python3.7/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib/python3.7/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 250, in default
return obj._asdict()
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/schema.py", line 36, in _asdict
return asdict(self)
File "/usr/lib/python3.7/dataclasses.py", line 1044, in asdict
return _asdict_inner(obj, dict_factory)
File "/usr/lib/python3.7/dataclasses.py", line 1051, in _asdict_inner
value = _asdict_inner(getattr(obj, f.name), dict_factory)
File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner
return copy.deepcopy(obj)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct
y = func(*args)
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
Command exited with non-zero status 1
2466.01user 359.38system 2:57:42elapsed 26%CPU (0avgtext+0avgdata 258044maxresident)k
316056inputs+48205712outputs (1181major+28465380minor)pagefaults 0swaps
"time archivebox add wallabag.list " took 2 hours 57 mins 43 secs
I have tried upgrading archivebox to the django branch but then it fails with:
$ time archivebox add --update-all wallabag.list
> ./sources/wallabag.list-1557227903.txt
[*] [2019-05-07 11:18:24] Parsing new links from output/sources/wallabag.list-1557227903.txt...
> Parsed 10317 links as Plain Text (0 new links added)
[*] [2019-05-07 11:19:02] Writing 10262 links to main index...
Traceback (most recent call last):
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 11, in <module>
load_entry_point('archivebox', 'console_scripts', 'archivebox')()
File "/home/anarcat/dist/ArchiveBox/archivebox/__main__.py", line 10, in main
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox.py", line 58, in main
pwd=pwd or OUTPUT_DIR,
File "/home/anarcat/dist/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox_add.py", line 55, in main
out_dir=pwd or OUTPUT_DIR,
File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/dist/ArchiveBox/archivebox/main.py", line 509, in add
write_main_index(links=all_links, out_dir=out_dir)
File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/dist/ArchiveBox/archivebox/index/__init__.py", line 233, in write_main_index
write_sql_main_index(links, out_dir=out_dir)
File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
return func(*args, **kwargs)
File "/home/anarcat/dist/ArchiveBox/archivebox/index/sql.py", line 37, in write_sql_main_index
Snapshot.objects.create(**info)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 422, in create
obj.save(force_insert=True, using=self.db)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save
force_update=force_update, update_fields=update_fields)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base
force_update, using, update_fields,
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 870, in _save_table
result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 908, in _do_insert
using=using, raw=raw)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _insert
return query.get_compiler(using=using).execute_sql(return_id)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1332, in execute_sql
cursor.execute(sql, params)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp
I suspect the database structure has changed but it’s not immediately obvious to me how to fix that…
Software versions
- OS: Debian buster 10
- ArchiveBox version: django branch, installed through pip -e in a virtualenv
- Python version: 3.7.3rc3?
- Chrome version: N/A
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (6 by maintainers)
Top Results From Across the Web
Bugfixes: Large crawls eventually crash during json loading ...
I suspect the database structure has changed but it's not immediately obvious to me how to fix that... Software versions. OS: Debian buster...
Read more >Processing large JSON files in Python without running out of ...
Loading complete JSON files into Python can use too much memory, leading to slowness or crashes. The solution: process JSON data one chunk...
Read more >Bug listing with status RESOLVED with resolution FIXED as at ...
Bug listing with status RESOLVED with resolution FIXED as at 2022/12/17 06:46: ... Bug:456 - "kdm fails to load" status:RESOLVED resolution:FIXED severity: ...
Read more >Changelog — Python 3.11.1 documentation
gh-99729: Fix an issue that could cause frames to be visible to Python code as they are being torn down, possibly leading to...
Read more >ValueError while loading a very large json file in python
I can convert the dictionary to json and store, but when I read from it again, the code crashes with Extra data error....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry for the long delay @anarcat I’m still swamped by my day job, going to try to get to this in the next couple months but it may be tricky with upcoming travel and client meetings. Whatever you do don’t scrap that archive, it’s 100% recoverable, I’m sure there’s a simple fix I can add for this in v0.4, I just need a solid block of time to figure it out.
I was running into the exact same problem (tested both v.0.4.2 and v.0.4.3 branches) yesterday and noticed that the type error (below) occurs when a link couldn’t be processed (e.g. 404).
tldr: The
output
field in the classArchiveResult
must always (i guess) contain a string value. In case of an error it holds an instance of the error object, which in turn makes the deepcopy operation at the end of the json serialization to throw the type error.Solution: in archivebox/extractors/title.py (line 62) Change the value of output from err to str(err).
I don’t know if i did overlook something else but this appears to fix the error.