Support Schema.org's audio property, NY Times Audum audio recordings.
See original GitHub issueChecklist
- I’m reporting a new site support request
- I’ve verified that I’m running yt-dlp version 2022.02.04. (update instructions)
- I’ve checked that all provided URLs are alive and playable in a browser
- I’ve checked that none of provided URLs violate any copyrights or contain any DRM to the best of my knowledge
- I’ve searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
- I’ve read the guidelines for opening an issue
- I’ve read about sharing account credentials and am willing to share it if required
Region
United States
Example URLs
- Single video: https://www.nytimes.com/2020/10/13/magazine/free-speech.html
- Single video: https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html
- Single video: https://www.nytimes.com/2022/02/11/technology/airtags-gps-surveillance.html
More examples can be found with site:nytimes.com audm
.
Description
The New York Times works with Audm in order to get narration for their long-form journalism. yt-dlp
’s generic extractor (--force-generic
) and dedicated NY Times extractor are unable to snag these readings. Luckily, this looks like an issue that would be easy to overcome.
Let’s look at a page’s script[type="application/ld+json"]
:
{
"@context": "http://schema.org",
"@type": "NewsArticle",
"description": "When 25 million people leave their jobs, it’s about more than just burnout.",
"image": [
{
"@context": "http://schema.org",
"@type": "ImageObject",
"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-videoSixteenByNineJumbo1600.jpg",
"height": 900,
"width": 1600
},
{
"@context": "http://schema.org",
"@type": "ImageObject",
"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-superJumbo.jpg",
"height": 1536,
"width": 2048
},
{
"@context": "http://schema.org",
"@type": "ImageObject",
"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-mediumSquareAt3X.jpg",
"height": 1800,
"width": 1800
}
],
"mainEntityOfPage": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
"url": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
"inLanguage": "en",
"author": [
{
"@context": "http://schema.org",
"@type": "Person",
"url": "",
"name": "Noreen Malone"
}
],
"dateModified": "2022-02-17T21:14:43.000Z",
"datePublished": "2022-02-15T10:00:15.000Z",
"headline": "The Age of Anti-Ambition",
"audio": [
{
"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
}
],
"publisher": {
"@id": "https://www.nytimes.com/#publisher"
},
"copyrightHolder": {
"@id": "https://www.nytimes.com/#publisher"
},
"sourceOrganization": {
"@id": "https://www.nytimes.com/#publisher"
},
"copyrightYear": 2022,
"isAccessibleForFree": false,
"hasPart": {
"@type": "WebPageElement",
"isAccessibleForFree": false,
"cssSelector": ".meteredContent"
},
"isPartOf": {
"@type": [
"CreativeWork",
"Product"
],
"name": "The New York Times",
"productID": "nytimes.com:basic"
}
}
The key of most interest to us is audio
:
[
{
"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
}
]
Schema.org documents the audio property on their site. We should add this to yt-dlp’s JSON-LD utilities so that generic extraction can be better bolstered, and then use it in the NY Times extractor which already understands the necessary metadata.
Verbose log
[debug] Command-line config: ['-vU', 'https://www.nytimes.com/2020/10/13/magazine/free-speech.html']
[debug] User config "/home/kwilliams/.config/yt-dlp/config": ['--netrc', '--sub-lang', 'en,en-US,eng', '--format', '(bestvideo+bestaudio/best)[format_id*=en]/(bestvideo+bestaudio/best)']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.02.04 [c1653e9ef]
[debug] Python version 3.10.0 (CPython 64bit) - Linux-5.4.0-99-generic-x86_64-with-glibc2.31
[debug] exe versions: ffmpeg 4.2.4, ffprobe 4.2.4, rtmpdump 2.4
[debug] Optional libraries: Cryptodome, mutagen, sqlite, websockets
[debug] Proxy map: {}
Latest version: 2022.02.04, Current version: 2022.02.04
yt-dlp is up to date (2022.02.04)
[debug] [NYTimesArticle] Extracting URL: https://www.nytimes.com/2020/10/13/magazine/free-speech.html
[NYTimesArticle] free-speech: Downloading webpage
ERROR: [NYTimesArticle] free-speech: Unable to extract podcast data; please report this issue on https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U; please report this issue on https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U
File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 612, in extract
ie_result = self._real_extract(url)
File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/nytimes.py", line 223, in _real_extract
podcast_data = self._search_regex(
File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 1186, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
The Power of Telling a Story Through Audio
The Times recently began publishing narrated articles, stories that are recorded and read aloud, either by professional narrators or reporters.
Read more >AudioObject - Schema.org Type
Property Expected Type Description
associatedArticle NewsArticle A NewsArticle associated with the Media Object.
bitrate Text The bitrate of the media object.
contentSize Text File size in...
Read more >Audm - Listen to feature stories from The Atlantic, WIRED, and ...
Listen to longform journalism you don't have time to read. Get access to stories from dozens of top publishers. Read by world-class narrators....
Read more >Toward an interdisciplinary framework for research and policy ...
in Libération, Die Zeit, the New York Times, MIT Technology Review, ... Pay attention to audio/visual forms of mis- and dis-information.
Read more >ELECTRIC SOUND - Monoskop
Electric sound: [he past and promise of electronic music. Y Of: ... chronology which is the basis for the organization ... reprinted in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Correct me if I’m wrong, but while
audio
is a part of the schema, this does not adhere to the specified format. So this will need to be implemented in the NYTimes extractor rather than in commonYou are right about that. I wonder why they don’t bother following the spec; it would look something like this:
I guess it will need to be NY Times-specfic. That’s a shame.