Bug: Indexing subtitles in media extractor fails when they're not UTF-8 encoded
See original GitHub issueI get the following when archiving a link to a YouTube video:
[+] [2022-05-27 13:56:21] "youtu.be/_ZNCCttVMg8?t=1592"
https://youtu.be/_ZNCCttVMg8?t=1592
> ./archive/1653659773.131913
> title
> favicon
> headers
> singlefile
> screenshot
> wget
> readability
> mercury
> media
Traceback (most recent call last):
File "/app/archivebox/extractors/__init__.py", line 109, in archive_link
! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))
result = method_function(link=link, out_dir=out_dir)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/extractors/media.py", line 75, in save_media
index_texts = [
File "/app/archivebox/extractors/media.py", line 76, in <listcomp>
text_file.read_text(encoding='utf-8').strip()
File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text
return f.read()
File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 87545: invalid start byte
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/archivebox", line 33, in <module>
sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
File "/app/archivebox/cli/__init__.py", line 140, in main
run_subcommand(
File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore
File "/app/archivebox/cli/archivebox_add.py", line 103, in main
add(
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/main.py", line 626, in add
archive_links(new_links, overwrite=False, **archive_kwargs)
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
File "/app/archivebox/util.py", line 114, in typechecked_function
return func(*args, **kwargs)
File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))
When this happens it stops processing the rest of the URLs I provided.
ArchiveBox version
ArchiveBox v0.6.3
Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
- GIT_BINARY - disabled /usr/bin/git
√ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp
√ CHROME_BINARY v100.0.4896.127 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 7 files valid /data
√ SOURCES_DIR 4 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 35 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 672.0 KB valid ./index.sqlite3
Issue Analytics
- State:
- Created a year ago
- Comments:12 (10 by maintainers)
Top Results From Across the Web
Android MediaExtractor fails to return sample - Stack Overflow
First search google for "NuMediaExtractor: read on track 0 failed with error -2147483646". I see 2 results: A, B. They write about bug...
Read more >Bug listing with status RESOLVED with resolution TEST ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:106950 - "media-video/transcode fails to encode with af6 export module" status:RESOLVED ...
Read more >Revision - 609af1a - [dplay] Add 'encoding: utf-8' line - snapshot ...
Windows users can download a .exe file and place it in their home directory ... I'm getting an error Unable to extract OpenGraph...
Read more >EZConvert User Guide - EZTitles
Note: EZConvert is not capable of encoding the DVB subtitles stream into the program stream of digital material on its own for now....
Read more >ChangeLog - MKVToolNix
Bug fixes * mkvmerge: Matroska reader: DVB subtitle tracks with a codec private ... for which no encoding has been specified, mkvmerge will...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ach damn 😦 I lived in the bay area. I feel your pain. Have you considered moving to Berlin?
Well as a Berliner you could apply for an EU grant. Somehow memex got one even tho they are for-profit now. It seems like a cool project but they refuse to implement bulk export. Their sponsors
If you ping me later I might have other ideas for sponsors.
Can I please ask for a tiny request?
As a new contributor can you please just enable access that github actions CI/CD will run on my PRs?
Besides my larger PRs on yt-dlp (which I know you are too busy to review since it requires some thought), I have this tiny one to fix everyone’s migration complaint about
dev
: https://github.com/ArchiveBox/ArchiveBox/pull/1027and this one-liner documentation change: https://github.com/ArchiveBox/ArchiveBox/pull/1023
Good luck with the move!
I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026
TDLR, until that’s merged:
Add this to ArchiveBox.conf:
If that doesn’t work and you still get crap UnicodeDecodeErrors, you can use my Docker
turian/archivebox:kludge-984-UTF8-bug
, instead ofarchivebox/archivebox
for now. Or use my branch and pip install or whatever from there.