Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rts.ch] Unable to extract internal video id

See original GitHub issue

Checklist

I’m reporting a broken site
I’ve verified that I’m running yt-dlp version 2022.02.04. (update instructions)
I’ve checked that all provided URLs are alive and playable in a browser
I’ve checked that all URLs and arguments with special characters are properly quoted or escaped
I’ve searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
I’ve read the guidelines for opening an issue
I’ve read about sharing account credentials and I’m willing to share it if required

Region

Anywhere

Description

Looks like some regex issue. Test link: https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html

Verbose log

yt-dlp -v -F https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html
[debug] Command-line config: ['-v', '-F', 'https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.02.03 [28469edd7] (zip)
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Python version 3.9.10 (CPython 64bit) - macOS-11.6.3-arm64-arm-64bit
[debug] exe versions: none
[debug] Optional libraries: sqlite
[debug] Proxy map: {}
[debug] [RTS] Extracting URL: https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html
[RTS] un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs: Downloading JSON metadata
[RTS] un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs: Downloading webpage
ERROR: [RTS] 12865814: Unable to extract internal video id; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 615, in extract
    ie_result = self._real_extract(url)
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/rts.py", line 159, in _real_extract
    internal_id = self._html_search_regex(
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 1198, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 1189, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

Issue Analytics

State:
Created 2 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

pukkandancommented, Oct 18, 2022

The issue with the test framework should ideally be tracked in a separate issue. For #5275, it’s sufficient to just add skip_download

1reaction

dirkfcommented, Feb 15, 2022

In addition to this, many of the extractor tests fail.

The regex error is just the start. The pattern has to be adjusted to find things like data-media-urn="urn:rts:video:nnnnnnnn" to get the numeric id nnnnnnnn.

The extractor fetches three sources of metadata as well as the page itself:

'http://www.rts.ch/a/%s.html?f=json/article' % item_id where item_id is the ID extracted from the URL (in this case 12865814)
the same API with the internal_id extracted from the webpage as item_id (in this case it should be 12861415, once the pattern is modified)
through the _get_media_data() method of the parent SRGSSRGIE extractor, this 'https://il.srgssr.ch/integrationlayer/2.0/%s/mediaComposition/%s/%s.json' % ('rts', 'video', item_id), whose result is discarded.

JSON # 1 is valid but contains no media links.

JSON # 2 is valid and contains media links at .video.JSONinfo.streams as a dict of format_id:url_path, with a base url at .video.JSONinfo.download. But the URLs constructed by joining each url_path to the base fail because the domain of the base URL isn’t valid.

The discarded JSON # 3 has a media link for the entire show (not just the target clip) at .chapterList[0].resourceList, a list of dicts with key url. It could be possible to construct the clip URL by finding the segment by clip ID in .chapterList[0].segmentList and adding the markIn and markOut query parameters to the URL, but the segment list isn’t returned with the onlyChapters=true query parameter that _get_media_data() uses for videos.

A browser session doesn’t use either of the first two API calls. It uses a different endpoint of the third, 'https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:%s:%s:%s.json' % ('rts', 'video', item_id). The JSON resulting from this has a media link at .chapterList[i].resourceList[0], where i is such that .chapterList[i].id is the clip ID, a dict with key url. This link can be passed to the existing code and finds the target clip.

So this patch gets the clip, but it would need more testing by users in CH, as the base extractor seems to handle a lot of sources that aren’t covered by tests:

--- old/yt-dlp/yt_dlp/extractor/srgssr.py
+++ new/yt-dlp/yt_dlp/extractor/srgssr.py
@@ -61,7 +61,7 @@
     def _get_media_data(self, bu, media_type, media_id):
         query = {'onlyChapters': True} if media_type == 'video' else {}
         full_media_data = self._download_json(
-            'https://il.srgssr.ch/integrationlayer/2.0/%s/mediaComposition/%s/%s.json'
+            'https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:%s:%s:%s.json'
             % (bu, media_type, media_id),
             media_id, query=query)['chapterList']
         try:
@@ -191,6 +191,7 @@
             'title': 'Saira: Tujetsch - tuttina cuntinuar cun Sedrun Mustér Turissem',
             'timestamp': 1444709160,
             'duration': 336.816,
+            'thumbnail': r're:https?://ws\.srf\.ch/asset/image/.+/\d+\.jpg',
         },
         'params': {
             # rtmp download

--- old/yt-dlp/yt_dlp/extractor/rts.py
+++ new/yt-dlp/yt_dlp/extractor/rts.py
@@ -10,6 +10,7 @@
     int_or_none,
     parse_duration,
     parse_iso8601,
+    try_get,
     unescapeHTML,
     urljoin,
 )
@@ -45,6 +46,7 @@
                 'title': 'Passe-moi les jumelles',
             },
             'playlist_mincount': 4,
+            'skip': '404 Page Not Found',
         },
         {
             'url': 'http://www.rts.ch/video/sport/hockey/5745975-1-2-kloten-fribourg-5-2-second-but-pour-gotteron-par-kwiatowski.html',
@@ -108,6 +110,7 @@
                 'title': 'Hockey: Davos décroche son 31e titre de champion de Suisse',
             },
             'playlist_mincount': 5,
+            'skip': 'Blocked outside Switzerland',
         },
         {
             'url': 'http://pages.rts.ch/emissions/passe-moi-les-jumelles/5624065-entre-ciel-et-mer.html',
@@ -157,14 +160,15 @@
                 return self.playlist_result(entries, media_id, all_info.get('title'))
 
             internal_id = self._html_search_regex(
-                r'<(?:video|audio) data-id="([0-9]+)"', page,
+                r'(?:<(?:video|audio)\s+data-id\s*=\s*"|data-media-urn\s*=\s*"urn:rts:(?:video|audio):)(\d+)"', page,
                 'internal video id')
             all_info = download_json(internal_id)
+            media_id = internal_id
 
         media_type = 'video' if 'video' in all_info else 'audio'
 
         # check for errors
-        self._get_media_data('rts', media_type, media_id)
+        media_info = self._get_media_data('rts', media_type, media_id)
 
         info = all_info['video']['JSONinfo'] if 'video' in all_info else all_info['audio']
 
@@ -175,8 +179,15 @@
                 r'-([0-9]+)k\.', url, 'bitrate', default=None))
 
         formats = []
+
+        def streams_from_media_data(m_data):
+            return dict(
+                (res.get('protocol', i), res['url'], )
+                for i, res in enumerate(try_get(m_data, lambda x: x['resourceList'], list), 1)
+                if try_get(res, lambda x: x['url']))
+
         streams = info.get('streams', {})
-        for format_id, format_url in streams.items():
+        for format_id, format_url in (streams_from_media_data(media_info) or streams).items():
             if format_id == 'hds_sd' and 'hds' in streams:
                 continue
             if format_id == 'hls_sd' and 'hls' in streams:
@@ -198,14 +209,14 @@
                     'tbr': extract_bitrate(format_url),
                 })
 
-        download_base = 'http://rtsww%s-d.rts.ch/' % ('-a' if media_type == 'audio' else '')
+        download_base = info.get('download', 'http://rtsww%s-d.rts.ch/' % ('-a' if media_type == 'audio' else '', ))
         for media in info.get('media', []):
             media_url = media.get('url')
             if not media_url or re.match(r'https?://', media_url):
                 continue
             rate = media.get('rate')
             ext = media.get('ext') or determine_ext(media_url, 'mp4')
-            format_id = ext
+            format_id = (re.findall(r'_[A-Za-z\d]+\.', media_url) or ('_%s.' % (ext, )))[-1][1:-1]
             if rate:
                 format_id += '-%dk' % rate
             formats.append({

Top Results From Across the Web

ESP32-CAM Troubleshooting Guide: Most Common Problems ...

This guide is a compilation with the most common errors when using the ESP32-CAM and how to fix them. The ESP32-CAM can be...

Support - Dealertrack

Forgot your password or need help with your login ID? Watch a video on how to reset your password or click below to...

[Notebook/AIO] Troubleshooting | Official Support | ASUS USA

Your browser can't play this video. ... After the extraction is completed, please copy the entire folder (RST_V19.1.0.1001_PV) to a USB ...

Known Issues with Oracle Database ... - Oracle Help Center

DCS-10001:Internal error encountered: Failed to get the LVM free space. ... on communication channel Process ID: 0 Session ID: 0 Serial number: 0...

Remote Management Controller User's Guide - Fujitsu

This chapter explains an overview of the Remote Management Controller. ... Video Redirection and Remote Storage Connection →"4.11 Console Redirection" ...