question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Indexing subtitles in media extractor fails when they're not UTF-8 encoded

See original GitHub issue

I get the following when archiving a link to a YouTube video:

[+] [2022-05-27 13:56:21] "youtu.be/_ZNCCttVMg8?t=1592"
    https://youtu.be/_ZNCCttVMg8?t=1592
    > ./archive/1653659773.131913
      > title
      > favicon
      > headers
      > singlefile
      > screenshot
      > wget
      > readability
      > mercury
      > media
Traceback (most recent call last):
  File "/app/archivebox/extractors/__init__.py", line 109, in archive_link
    ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))

    result = method_function(link=link, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/media.py", line 75, in save_media
    index_texts = [
  File "/app/archivebox/extractors/media.py", line 76, in <listcomp>
    text_file.read_text(encoding='utf-8').strip()
  File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text
    return f.read()
  File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 87545: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 626, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))

When this happens it stops processing the rest of the URLs I provided.

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10                                                   
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v100.0.4896.127  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data                                                                       
 √  SOURCES_DIR           4 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           35 files        valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             672.0 KB        valid     ./index.sqlite3

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
turiancommented, Sep 15, 2022

Probably still a month or two out. I’m currently trying to find new housing in Oakland and that’s taking up all my free time.

Ach damn 😦 I lived in the bay area. I feel your pain. Have you considered moving to Berlin?

Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y’all posted, sorry for the brutal delay with this release, I know it’s taking a lot longer than usual and I know that has real impact on everyone’s workflows.

Well as a Berliner you could apply for an EU grant. Somehow memex got one even tho they are for-profit now. It seems like a cool project but they refuse to implement bulk export. Their sponsors

image

If you ping me later I might have other ideas for sponsors.

Can I please ask for a tiny request?

As a new contributor can you please just enable access that github actions CI/CD will run on my PRs?

image

Besides my larger PRs on yt-dlp (which I know you are too busy to review since it requires some thought), I have this tiny one to fix everyone’s migration complaint about dev: https://github.com/ArchiveBox/ArchiveBox/pull/1027

and this one-liner documentation change: https://github.com/ArchiveBox/ArchiveBox/pull/1023

Good luck with the move!

1reaction
turiancommented, Sep 12, 2022

I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026

TDLR, until that’s merged:

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn’t work and you still get crap UnicodeDecodeErrors, you can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Android MediaExtractor fails to return sample - Stack Overflow
First search google for "NuMediaExtractor: read on track 0 failed with error -2147483646". I see 2 results: A, B. They write about bug...
Read more >
Bug listing with status RESOLVED with resolution TEST ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:106950 - "media-video/transcode fails to encode with af6 export module" status:RESOLVED ...
Read more >
Revision - 609af1a - [dplay] Add 'encoding: utf-8' line - snapshot ...
Windows users can download a .exe file and place it in their home directory ... I'm getting an error Unable to extract OpenGraph...
Read more >
EZConvert User Guide - EZTitles
Note: EZConvert is not capable of encoding the DVB subtitles stream into the program stream of digital material on its own for now....
Read more >
ChangeLog - MKVToolNix
Bug fixes * mkvmerge: Matroska reader: DVB subtitle tracks with a codec private ... for which no encoding has been specified, mkvmerge will...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found