[YouTube] Tab Extractor may not get all pages for (very) large channels
See original GitHub issueChecklist
- I’m reporting a broken site support issue
- I’ve verified that I’m running youtube-dl version 2021.02.10
- I’ve checked that all provided URLs are alive and playable in a browser
- I’ve checked that all URLs and arguments with special characters are properly quoted or escaped
- I’ve searched the bugtracker for similar bug reports including closed ones
- I’ve read bugs section in FAQ
Verbose log
Test: channel https://www.youtube.com/user/TEDxTalks/videos with ~163,192 videos (as of writing)
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '--flat-playlist', 'https://www.youtube.com/user/TEDxTalks/videos']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.02.10
[debug] Python version 3.9.1 (CPython) - Linux-5.10.15-1-MANJARO-x86_64-with-glibc2.33
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[youtube:tab] TEDxTalks: Downloading webpage
[download] Downloading playlist: TEDx Talks - Videos
[youtube:tab] Downloading page 1
[youtube:tab] Downloading page 2
[youtube:tab] Downloading page 3
[youtube:tab] Downloading page 4
[youtube:tab] Downloading page 5
[youtube:tab] Downloading page 6
[youtube:tab] Downloading page 7
[youtube:tab] Downloading page 8
[youtube:tab] Downloading page 9
[youtube:tab] Downloading page 10
[youtube:tab] Downloading page 11
[...]
[youtube:tab] Downloading page 1674
[youtube:tab] Downloading page 1675
[youtube:tab] Downloading page 1676
[youtube:tab] Downloading page 1677
[youtube:tab] Downloading page 1678
[youtube:tab] playlist TEDx Talks - Videos: Downloading 50339 videos
[download] Downloading video 1 of 50339
[download] Downloading video 2 of 50339
[download] Downloading video 3 of 50339
[...]
[download] Downloading video 50338 of 50339
[download] Downloading video 50339 of 50339
[download] Finished downloading playlist: TEDx Talks - Videos
In this case it only gathered 50399 videos.
Running this again for the sake of showing this isn’t a fixed limit YouTube imposes:
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '--flat-playlist', 'https://www.youtube.com/user/TEDxTalks/videos']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.02.10
[debug] Python version 3.9.1 (CPython) - Linux-5.10.15-1-MANJARO-x86_64-with-glibc2.33
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[youtube:tab] TEDxTalks: Downloading webpage
[download] Downloading playlist: TEDx Talks - Videos
[youtube:tab] Downloading page 1
[youtube:tab] Downloading page 2
[youtube:tab] Downloading page 3
[youtube:tab] Downloading page 4
[youtube:tab] Downloading page 5
[youtube:tab] Downloading page 6
[youtube:tab] Downloading page 7
[youtube:tab] Downloading page 8
[youtube:tab] Downloading page 9
[youtube:tab] Downloading page 10
[youtube:tab] Downloading page 11
[youtube:tab] Downloading page 12
[...]
[youtube:tab] Downloading page 2100
[youtube:tab] Downloading page 2101
[youtube:tab] Downloading page 2102
[youtube:tab] Downloading page 2103
[youtube:tab] Downloading page 2104
[youtube:tab] Downloading page 2105
[youtube:tab] Downloading page 2106
[youtube:tab] playlist TEDx Talks - Videos: Downloading 63179 videos
[download] Downloading video 1 of 63179
[download] Downloading video 2 of 63179
[download] Downloading video 3 of 63179
[download] Downloading video 4 of 63179
[download] Downloading video 5 of 63179
[download] Downloading video 6 of 63179
[download] Downloading video 7 of 63179
[download] Downloading video 8 of 63179
[...]
[download] Downloading video 63177 of 63179
[download] Downloading video 63178 of 63179
[download] Downloading video 63179 of 63179
[download] Finished downloading playlist: TEDx Talks - Videos
This time it gathered 63179 videos.
Description
I’ve done some investigating into what I think causing this:
When downloading tab pages, the next page downloaded using the continuation token found in the previous may not contain any continuation items/contents (i.e. videos). This appears to be a server side issue with YouTube (whether that be a form of rate-limiting). The HTTP status for these pages is 200.
From my findings, simply retrying the page download with the same continuation token (sometimes more than once) will eventually(?) return a page with the continuation items.
I have found this mostly happens when you try to download channels with tens of thousands of videos.
This is an issue as when there is no continuation items, youtube-dl breaks out of the page extraction loop. In the case of this issue, this causes youtube-dl to not get all the videos on the channel/provided by YouTube, and incorrectly treats it like it has extracted all (false success).
Part of the extractor I’m referring to for reference: https://github.com/ytdl-org/youtube-dl/blob/9fc5eafb8e384453a49f7cfe73147be491f0b19d/youtube_dl/extractor/youtube.py#L2483-L2553
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:7 (6 by maintainers)
Top GitHub Comments
ok. I misunderstood the original problem and thought that only continuation token is missing. When I tried to test the issue, I got 429’d 😢
This doesn’t seem limited to large playlists.
This channel, as of writing, is currently broken when downloaded using oldest first sorting. In this particular case retrying doesn’t help (tested) however it shows the false success issue after it gets incomplete data for the first continuation page, in which I’d expect an exception to be raised.