question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rts.ch] Unable to extract internal video id

See original GitHub issue

Checklist

Region

Anywhere

Description

Looks like some regex issue. Test link: https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html

Verbose log

yt-dlp -v -F https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html
[debug] Command-line config: ['-v', '-F', 'https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.02.03 [28469edd7] (zip)
[debug] Plugins: ['SamplePluginIE', 'SamplePluginPP']
[debug] Python version 3.9.10 (CPython 64bit) - macOS-11.6.3-arm64-arm-64bit
[debug] exe versions: none
[debug] Optional libraries: sqlite
[debug] Proxy map: {}
[debug] [RTS] Extracting URL: https://www.rts.ch/info/regions/valais/12865814-un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs.html
[RTS] un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs: Downloading JSON metadata
[RTS] un-bouquetin-emporte-par-un-aigle-royal-sur-les-hauts-de-fully-vs: Downloading webpage
ERROR: [RTS] 12865814: Unable to extract internal video id; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using -U
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 615, in extract
    ie_result = self._real_extract(url)
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/rts.py", line 159, in _real_extract
    internal_id = self._html_search_regex(
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 1198, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/Users/zig/Downloads/DrB/yt-dlp/./yt-dlp/yt_dlp/extractor/common.py", line 1189, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pukkandancommented, Oct 18, 2022

The issue with the test framework should ideally be tracked in a separate issue. For #5275, it’s sufficient to just add skip_download

1reaction
dirkfcommented, Feb 15, 2022

In addition to this, many of the extractor tests fail.

The regex error is just the start. The pattern has to be adjusted to find things like data-media-urn="urn:rts:video:nnnnnnnn" to get the numeric id nnnnnnnn.

The extractor fetches three sources of metadata as well as the page itself:

  1. 'http://www.rts.ch/a/%s.html?f=json/article' % item_id where item_id is the ID extracted from the URL (in this case 12865814)
  2. the same API with the internal_id extracted from the webpage as item_id (in this case it should be 12861415, once the pattern is modified)
  3. through the _get_media_data() method of the parent SRGSSRGIE extractor, this 'https://il.srgssr.ch/integrationlayer/2.0/%s/mediaComposition/%s/%s.json' % ('rts', 'video', item_id), whose result is discarded.

JSON # 1 is valid but contains no media links.

JSON # 2 is valid and contains media links at .video.JSONinfo.streams as a dict of format_id:url_path, with a base url at .video.JSONinfo.download. But the URLs constructed by joining each url_path to the base fail because the domain of the base URL isn’t valid.

The discarded JSON # 3 has a media link for the entire show (not just the target clip) at .chapterList[0].resourceList, a list of dicts with key url. It could be possible to construct the clip URL by finding the segment by clip ID in .chapterList[0].segmentList and adding the markIn and markOut query parameters to the URL, but the segment list isn’t returned with the onlyChapters=true query parameter that _get_media_data() uses for videos.

A browser session doesn’t use either of the first two API calls. It uses a different endpoint of the third, 'https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:%s:%s:%s.json' % ('rts', 'video', item_id). The JSON resulting from this has a media link at .chapterList[i].resourceList[0], where i is such that .chapterList[i].id is the clip ID, a dict with key url. This link can be passed to the existing code and finds the target clip.

So this patch gets the clip, but it would need more testing by users in CH, as the base extractor seems to handle a lot of sources that aren’t covered by tests:

--- old/yt-dlp/yt_dlp/extractor/srgssr.py
+++ new/yt-dlp/yt_dlp/extractor/srgssr.py
@@ -61,7 +61,7 @@
     def _get_media_data(self, bu, media_type, media_id):
         query = {'onlyChapters': True} if media_type == 'video' else {}
         full_media_data = self._download_json(
-            'https://il.srgssr.ch/integrationlayer/2.0/%s/mediaComposition/%s/%s.json'
+            'https://il.srgssr.ch/integrationlayer/2.0/mediaComposition/byUrn/urn:%s:%s:%s.json'
             % (bu, media_type, media_id),
             media_id, query=query)['chapterList']
         try:
@@ -191,6 +191,7 @@
             'title': 'Saira: Tujetsch - tuttina cuntinuar cun Sedrun Mustér Turissem',
             'timestamp': 1444709160,
             'duration': 336.816,
+            'thumbnail': r're:https?://ws\.srf\.ch/asset/image/.+/\d+\.jpg',
         },
         'params': {
             # rtmp download

--- old/yt-dlp/yt_dlp/extractor/rts.py
+++ new/yt-dlp/yt_dlp/extractor/rts.py
@@ -10,6 +10,7 @@
     int_or_none,
     parse_duration,
     parse_iso8601,
+    try_get,
     unescapeHTML,
     urljoin,
 )
@@ -45,6 +46,7 @@
                 'title': 'Passe-moi les jumelles',
             },
             'playlist_mincount': 4,
+            'skip': '404 Page Not Found',
         },
         {
             'url': 'http://www.rts.ch/video/sport/hockey/5745975-1-2-kloten-fribourg-5-2-second-but-pour-gotteron-par-kwiatowski.html',
@@ -108,6 +110,7 @@
                 'title': 'Hockey: Davos décroche son 31e titre de champion de Suisse',
             },
             'playlist_mincount': 5,
+            'skip': 'Blocked outside Switzerland',
         },
         {
             'url': 'http://pages.rts.ch/emissions/passe-moi-les-jumelles/5624065-entre-ciel-et-mer.html',
@@ -157,14 +160,15 @@
                 return self.playlist_result(entries, media_id, all_info.get('title'))
 
             internal_id = self._html_search_regex(
-                r'<(?:video|audio) data-id="([0-9]+)"', page,
+                r'(?:<(?:video|audio)\s+data-id\s*=\s*"|data-media-urn\s*=\s*"urn:rts:(?:video|audio):)(\d+)"', page,
                 'internal video id')
             all_info = download_json(internal_id)
+            media_id = internal_id
 
         media_type = 'video' if 'video' in all_info else 'audio'
 
         # check for errors
-        self._get_media_data('rts', media_type, media_id)
+        media_info = self._get_media_data('rts', media_type, media_id)
 
         info = all_info['video']['JSONinfo'] if 'video' in all_info else all_info['audio']
 
@@ -175,8 +179,15 @@
                 r'-([0-9]+)k\.', url, 'bitrate', default=None))
 
         formats = []
+
+        def streams_from_media_data(m_data):
+            return dict(
+                (res.get('protocol', i), res['url'], )
+                for i, res in enumerate(try_get(m_data, lambda x: x['resourceList'], list), 1)
+                if try_get(res, lambda x: x['url']))
+
         streams = info.get('streams', {})
-        for format_id, format_url in streams.items():
+        for format_id, format_url in (streams_from_media_data(media_info) or streams).items():
             if format_id == 'hds_sd' and 'hds' in streams:
                 continue
             if format_id == 'hls_sd' and 'hls' in streams:
@@ -198,14 +209,14 @@
                     'tbr': extract_bitrate(format_url),
                 })
 
-        download_base = 'http://rtsww%s-d.rts.ch/' % ('-a' if media_type == 'audio' else '')
+        download_base = info.get('download', 'http://rtsww%s-d.rts.ch/' % ('-a' if media_type == 'audio' else '', ))
         for media in info.get('media', []):
             media_url = media.get('url')
             if not media_url or re.match(r'https?://', media_url):
                 continue
             rate = media.get('rate')
             ext = media.get('ext') or determine_ext(media_url, 'mp4')
-            format_id = ext
+            format_id = (re.findall(r'_[A-Za-z\d]+\.', media_url) or ('_%s.' % (ext, )))[-1][1:-1]
             if rate:
                 format_id += '-%dk' % rate
             formats.append({
Read more comments on GitHub >

github_iconTop Results From Across the Web

ESP32-CAM Troubleshooting Guide: Most Common Problems ...
This guide is a compilation with the most common errors when using the ESP32-CAM and how to fix them. The ESP32-CAM can be...
Read more >
Support - Dealertrack
Forgot your password or need help with your login ID? Watch a video on how to reset your password or click below to...
Read more >
[Notebook/AIO] Troubleshooting | Official Support | ASUS USA
Your browser can't play this video. ... After the extraction is completed, please copy the entire folder (RST_V19.1.0.1001_PV) to a USB ...
Read more >
Known Issues with Oracle Database ... - Oracle Help Center
DCS-10001:Internal error encountered: Failed to get the LVM free space. ... on communication channel Process ID: 0 Session ID: 0 Serial number: 0...
Read more >
Remote Management Controller User's Guide - Fujitsu
This chapter explains an overview of the Remote Management Controller. ... Video Redirection and Remote Storage Connection →"4.11 Console Redirection" ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found