question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support Schema.org's audio property, NY Times Audum audio recordings.

See original GitHub issue

Checklist

Region

United States

Example URLs

More examples can be found with site:nytimes.com audm.

Description

The New York Times works with Audm in order to get narration for their long-form journalism. yt-dlp’s generic extractor (--force-generic) and dedicated NY Times extractor are unable to snag these readings. Luckily, this looks like an issue that would be easy to overcome.

Let’s look at a page’s script[type="application/ld+json"]:

{
	"@context": "http://schema.org",
	"@type": "NewsArticle",
	"description": "When 25 million people leave their jobs, it’s about more than just burnout.",
	"image": [
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-videoSixteenByNineJumbo1600.jpg",
			"height": 900,
			"width": 1600
		},
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-superJumbo.jpg",
			"height": 1536,
			"width": 2048
		},
		{
			"@context": "http://schema.org",
			"@type": "ImageObject",
			"url": "https://static01.nyt.com/images/2022/02/20/magazine/20mag-intro-03/20mag-intro-03-mediumSquareAt3X.jpg",
			"height": 1800,
			"width": 1800
		}
	],
	"mainEntityOfPage": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
	"url": "https://www.nytimes.com/2022/02/15/magazine/anti-ambition-age.html",
	"inLanguage": "en",
	"author": [
		{
			"@context": "http://schema.org",
			"@type": "Person",
			"url": "",
			"name": "Noreen Malone"
		}
	],
	"dateModified": "2022-02-17T21:14:43.000Z",
	"datePublished": "2022-02-15T10:00:15.000Z",
	"headline": "The Age of Anti-Ambition",
	"audio": [
		{
			"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
		}
	],
	"publisher": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"copyrightHolder": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"sourceOrganization": {
		"@id": "https://www.nytimes.com/#publisher"
	},
	"copyrightYear": 2022,
	"isAccessibleForFree": false,
	"hasPart": {
		"@type": "WebPageElement",
		"isAccessibleForFree": false,
		"cssSelector": ".meteredContent"
	},
	"isPartOf": {
		"@type": [
			"CreativeWork",
			"Product"
		],
		"name": "The New York Times",
		"productID": "nytimes.com:basic"
	}
}

The key of most interest to us is audio:

[
	{
		"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
	}
]

That’s the audio we want!

Schema.org documents the audio property on their site. We should add this to yt-dlp’s JSON-LD utilities so that generic extraction can be better bolstered, and then use it in the NY Times extractor which already understands the necessary metadata.

Verbose log

[debug] Command-line config: ['-vU', 'https://www.nytimes.com/2020/10/13/magazine/free-speech.html']
[debug] User config "/home/kwilliams/.config/yt-dlp/config": ['--netrc', '--sub-lang', 'en,en-US,eng', '--format', '(bestvideo+bestaudio/best)[format_id*=en]/(bestvideo+bestaudio/best)']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.02.04 [c1653e9ef]
[debug] Python version 3.10.0 (CPython 64bit) - Linux-5.4.0-99-generic-x86_64-with-glibc2.31
[debug] exe versions: ffmpeg 4.2.4, ffprobe 4.2.4, rtmpdump 2.4
[debug] Optional libraries: Cryptodome, mutagen, sqlite, websockets
[debug] Proxy map: {}
Latest version: 2022.02.04, Current version: 2022.02.04
yt-dlp is up to date (2022.02.04)
[debug] [NYTimesArticle] Extracting URL: https://www.nytimes.com/2020/10/13/magazine/free-speech.html
[NYTimesArticle] free-speech: Downloading webpage
ERROR: [NYTimesArticle] free-speech: Unable to extract podcast data; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 612, in extract
    ie_result = self._real_extract(url)
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/nytimes.py", line 223, in _real_extract
    podcast_data = self._search_regex(
  File "/home/kwilliams/.asdf/installs/yt-dlp/2022.02.04/venv/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 1186, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pukkandancommented, Feb 19, 2022
	"audio": [
		{
			"@id": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
		}
	],

Correct me if I’m wrong, but while audio is a part of the schema, this does not adhere to the specified format. So this will need to be implemented in the NYTimes extractor rather than in common

0reactions
SuperSonicHub1commented, Feb 19, 2022

Where does it say that the value of this audio item can be a list of objects with ‘@id’ having a value that is an audio media URL?

You are right about that. I wonder why they don’t bother following the spec; it would look something like this:

{
  "contentURL": "https://static.nytimes.com/podcasts/2022/02/15/magazine/15audm-where-we-go-malone/220215-where-we-go-malone-nytmag-audm.mp3"
}

I guess it will need to be NY Times-specfic. That’s a shame.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Power of Telling a Story Through Audio
The Times recently began publishing narrated articles, stories that are recorded and read aloud, either by professional narrators or reporters.
Read more >
AudioObject - Schema.org Type
Property Expected Type Description associatedArticle NewsArticle A NewsArticle associated with the Media Object. bitrate Text The bitrate of the media object. contentSize Text File size in...
Read more >
Audm - Listen to feature stories from The Atlantic, WIRED, and ...
Listen to longform journalism you don't have time to read. Get access to stories from dozens of top publishers. Read by world-class narrators....
Read more >
Toward an interdisciplinary framework for research and policy ...
in Libération, Die Zeit, the New York Times, MIT Technology Review, ... Pay attention to audio/visual forms of mis- and dis-information.
Read more >
ELECTRIC SOUND - Monoskop
Electric sound: [he past and promise of electronic music. Y Of: ... chronology which is the basis for the organization ... reprinted in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found