question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bugfixes: Large crawls eventually crash during json loading/dumping

See original GitHub issue

Describe the bug

This is yet another 0.4.1 bug, feel free to close it but do notice that I can’t upgrade either. 😉

Steps to reproduce

  1. Ran ArchiveBox with around 10,000 URLs to crawl
  2. Wait around 3 hours
  3. Crawl eventually crashes with: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'

Screenshots or log output

Full backtrace:

[+] [2019-05-07 03:38:34] "www.varnish-software.com/blog/introducing-varnish-massive-storage-engine"
    https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine
    > ./archive/1557189662.144
      > title
        Failed:
            HTTPError HTTP Error 404: Not Found
        Run to see full output:
            cd /srv/backup/archive/archivebox/archive/1557189662.144;
            curl https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine | grep <title

      > favicon
      > wget
        Failed:
             Got an error from the server
            Got wget response code: 8.
            https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine:
            2019-05-07 03:38:35 erreur 404 : Not Found.
        Run to see full output:
            cd /srv/backup/archive/archivebox/archive/1557189662.144;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557200315 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.varnish-software.com/blog/introducing-varnish-massive-storage-engine

      > pdf
      > screenshot
      > dom
      > media
    ! Failed to archive link: TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
                                     
Traceback (most recent call last):   
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>
    sys.exit(main())                 
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main
    pwd=pwd or OUTPUT_DIR,           
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
    out_dir=pwd or OUTPUT_DIR,       
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add
    archive_link(link, out_dir=link.link_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 84, in archive_link
    write_link_details(link, out_dir=link.link_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 345, in write_link_details
    write_json_link_details(link, out_dir=out_dir)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 89, in write_json_link_details
    atomic_write(link._asdict(extended=True), path)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 73, in atomic_write
    pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
  File "/usr/lib/python3.7/json/__init__.py", line 179, in dump
    for chunk in iterable:           
  File "/usr/lib/python3.7/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 325, in _iterencode_list
    yield from chunks                
  File "/usr/lib/python3.7/json/encoder.py", line 438, in _iterencode
    o = _default(o)                  
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 250, in default
    return obj._asdict()             
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/schema.py", line 36, in _asdict
    return asdict(self)              
  File "/usr/lib/python3.7/dataclasses.py", line 1044, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.7/dataclasses.py", line 1051, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner
    return copy.deepcopy(obj)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct
    y = func(*args)
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'
Command exited with non-zero status 1
2466.01user 359.38system 2:57:42elapsed 26%CPU (0avgtext+0avgdata 258044maxresident)k
316056inputs+48205712outputs (1181major+28465380minor)pagefaults 0swaps
"time archivebox add wallabag.list " took 2 hours 57 mins 43 secs

I have tried upgrading archivebox to the django branch but then it fails with:

$ time archivebox add --update-all wallabag.list
    > ./sources/wallabag.list-1557227903.txt

[*] [2019-05-07 11:18:24] Parsing new links from output/sources/wallabag.list-1557227903.txt...
    > Parsed 10317 links as Plain Text (0 new links added)

[*] [2019-05-07 11:19:02] Writing 10262 links to main index...
Traceback (most recent call last):
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 11, in <module>
    load_entry_point('archivebox', 'console_scripts', 'archivebox')()
  File "/home/anarcat/dist/ArchiveBox/archivebox/__main__.py", line 10, in main
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox.py", line 58, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/anarcat/dist/ArchiveBox/archivebox/cli/archivebox_add.py", line 55, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/main.py", line 509, in add
    write_main_index(links=all_links, out_dir=out_dir)
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/index/__init__.py", line 233, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/home/anarcat/dist/ArchiveBox/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/anarcat/dist/ArchiveBox/archivebox/index/sql.py", line 37, in write_sql_main_index
    Snapshot.objects.create(**info)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 422, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 741, in save
    force_update=force_update, update_fields=update_fields)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 779, in save_base
    force_update, using, update_fields,
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 870, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/base.py", line 908, in _do_insert
    using=using, raw=raw)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1332, in execute_sql
    cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py", line 383, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.IntegrityError: UNIQUE constraint failed: core_snapshot.timestamp

I suspect the database structure has changed but it’s not immediately obvious to me how to fix that…

Software versions

  • OS: Debian buster 10
  • ArchiveBox version: django branch, installed through pip -e in a virtualenv
  • Python version: 3.7.3rc3?
  • Chrome version: N/A

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
piratecommented, Jul 9, 2019

Sorry for the long delay @anarcat I’m still swamped by my day job, going to try to get to this in the next couple months but it may be tricky with upcoming travel and client meetings. Whatever you do don’t scrap that archive, it’s 100% recoverable, I’m sure there’s a simple fix I can add for this in v0.4, I just need a solid block of time to figure it out.

1reaction
dvpccommented, May 16, 2020

I was running into the exact same problem (tested both v.0.4.2 and v.0.4.3 branches) yesterday and noticed that the type error (below) occurs when a link couldn’t be processed (e.g. 404).

...
  File "/usr/lib/python3.7/dataclasses.py", line 1085, in _asdict_inner
    return copy.deepcopy(obj)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/copy.py", line 274, in _reconstruct
    y = func(*args)
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'

tldr: The output field in the class ArchiveResult must always (i guess) contain a string value. In case of an error it holds an instance of the error object, which in turn makes the deepcopy operation at the end of the json serialization to throw the type error.

Solution: in archivebox/extractors/title.py (line 62) Change the value of output from err to str(err).

def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
...
    except Exception as err:
        status = 'failed'
        output = str(err)
    finally:
        timer.end()
...

I don’t know if i did overlook something else but this appears to fix the error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bugfixes: Large crawls eventually crash during json loading ...
I suspect the database structure has changed but it's not immediately obvious to me how to fix that... Software versions. OS: Debian buster...
Read more >
Processing large JSON files in Python without running out of ...
Loading complete JSON files into Python can use too much memory, leading to slowness or crashes. The solution: process JSON data one chunk...
Read more >
Bug listing with status RESOLVED with resolution FIXED as at ...
Bug listing with status RESOLVED with resolution FIXED as at 2022/12/17 06:46: ... Bug:456 - "kdm fails to load" status:RESOLVED resolution:FIXED severity: ...
Read more >
Changelog — Python 3.11.1 documentation
gh-99729: Fix an issue that could cause frames to be visible to Python code as they are being torn down, possibly leading to...
Read more >
ValueError while loading a very large json file in python
I can convert the dictionary to json and store, but when I read from it again, the code crashes with Extra data error....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found