Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support Schema.org's audio property, NY Times Audum audio recordings.

See original GitHub issue

Checklist

I’m reporting a new site support request
I’ve verified that I’m running yt-dlp version 2022.02.04. (update instructions)
I’ve checked that all provided URLs are alive and playable in a browser
I’ve checked that none of provided URLs violate any copyrights or contain any DRM to the best of my knowledge
I’ve searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
I’ve read the guidelines for opening an issue
I’ve read about sharing account credentials and am willing to share it if required

Region

United States

Example URLs

Single video: https://www.nytimes.com/2020/10/13/magazine/free-speech.html
Single video: https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html
Single video: https://www.nytimes.com/2022/02/11/technology/airtags-gps-surveillance.html

More examples can be found with site:nytimes.com audm.

Description

The New York Times works with Audm in order to get narration for their long-form journalism. yt-dlp’s generic extractor (--force-generic) and dedicated NY Times extractor are unable to snag these readings. Luckily, this looks like an issue that would be easy to overcome.

Let’s look at a page’s script[type="application/ld+json"]:

{
	"@context": "http://schema.org",
	"@type": "NewsArticle",
	"description": "When 25 million people leave their jobs, it’s about more than just burnout.",
	"image": [
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-videoSixteenByNineJumbo1600.jpg",
			"height": 900,
			"width": 1600
		},
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-superJumbo.jpg",
			"height": 1536,
			"width": 2048
		},
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-mediumSquareAt3X.jpg",
			"height": 1800,
			"width": 1800
		}
	],
	"mainEntityOfPage": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
	"url": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
	"inLanguage": "en",
	"author": [
		{
			"@context": "http://schema.org",
			"@type": "Person",
			"url": "",
			"name": "Noreen Malone"
		}
	],
	"dateModified": "2022-02-17T21:14:43.000Z",
	"datePublished": "2022-02-15T10:00:15.000Z",
	"headline": "The Age of Anti-Ambition",
	"audio": [
		{
			"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
		}
	],
	"publisher": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"copyrightHolder": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"sourceOrganization": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"copyrightYear": 2022,
	"isAccessibleForFree": false,
	"hasPart": {
		"@type": "WebPageElement",
		"isAccessibleForFree": false,
		"cssSelector": ".meteredContent"
	},
	"isPartOf": {
		"@type": [
			"CreativeWork",
			"Product"
		],
		"name": "The New York Times",
		"productID": "nytimes.com:basic"
	}
}

The key of most interest to us is audio:

[
	{
		"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
	}
]

That’s the audio we want!

Schema.org documents the audio property on their site. We should add this to yt-dlp’s JSON-LD utilities so that generic extraction can be better bolstered, and then use it in the NY Times extractor which already understands the necessary metadata.

Verbose log

[debug] Command-line config: ['-vU', 'https://www.nytimes.com/2020/10/13/magazine/free-speech.html']
[debug] User config "/home/kwilliams/.config/yt-dlp/config": ['--netrc', '--sub-lang', 'en,en-US,eng', '--format', '(bestvideo+bestaudio/best)[format_id*=en]/(bestvideo+bestaudio/best)']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.02.04 [c1653e9ef]
[debug] Python version 3.10.0 (CPython 64bit) - Linux-5.4.0-99-generic-x86_64-with-glibc2.31
[debug] exe versions: ffmpeg 4.2.4, ffprobe 4.2.4, rtmpdump 2.4
[debug] Optional libraries: Cryptodome, mutagen, sqlite, websockets
[debug] Proxy map: {}
Latest version: 2022.02.04, Current version: 2022.02.04
yt-dlp is up to date (2022.02.04)
[debug] [NYTimesArticle] Extracting URL: https://www.nytimes.com/2020/10/13/magazine/free-speech.html
[NYTimesArticle] free-speech: Downloading webpage
ERROR: [NYTimesArticle] free-speech: Unable to extract podcast data; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 612, in extract
    ie_result = self._real_extract(url)
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/nytimes.py", line 223, in _real_extract
    podcast_data = self._search_regex(
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 1186, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

pukkandancommented, Feb 19, 2022

	"audio": [
		{
			"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
		}
	],

Correct me if I’m wrong, but while audio is a part of the schema, this does not adhere to the specified format. So this will need to be implemented in the NYTimes extractor rather than in common

0reactions

SuperSonicHub1commented, Feb 19, 2022

Where does it say that the value of this audio item can be a list of objects with ‘@id’ having a value that is an audio media URL?

You are right about that. I wonder why they don’t bother following the spec; it would look something like this:

{
  "contentURL": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
}

I guess it will need to be NY Times-specfic. That’s a shame.

Top Results From Across the Web

The Power of Telling a Story Through Audio

The Times recently began publishing narrated articles, stories that are recorded and read aloud, either by professional narrators or reporters.

AudioObject - Schema.org Type

Property Expected Type Description associatedArticle NewsArticle A NewsArticle associated with the Media Object. bitrate Text The bitrate of the media object. contentSize Text File size in...

Audm - Listen to feature stories from The Atlantic, WIRED, and ...

Listen to longform journalism you don't have time to read. Get access to stories from dozens of top publishers. Read by world-class narrators....

Toward an interdisciplinary framework for research and policy ...

in Libération, Die Zeit, the New York Times, MIT Technology Review, ... Pay attention to audio/visual forms of mis- and dis-information.

ELECTRIC SOUND - Monoskop

Electric sound: [he past and promise of electronic music. Y Of: ... chronology which is the basis for the organization ... reprinted in...