question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

problem with double-dot segments (`/../`) after the hostname

See original GitHub issue

Checklist

  • I’m reporting a bug unrelated to a specific site
  • I’ve verified that I’m running yt-dlp version 2022.04.08 (update instructions) or later (specify commit)
  • I’ve checked that all provided URLs are alive and playable in a browser
  • I’ve checked that all URLs and arguments with special characters are properly quoted or escaped
  • I’ve searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
  • I’ve read the guidelines for opening an issue

Description

Some URLs have a double-dot section after the hostname, which causes problems in yt-dlp.

Example: https://streamwo.com/v/gp445h2f if we resolve this URL we get this:

$ yt-dlp --get-url https://streamwo.com/v/gp445h2f 
https://reoa92d.com/../uploaded/1649416469.mp4#t=0.1

Which has a ../ segment right after the hostname. Opening this result in browsers, or downloading it using curl is no problem:

$ curl -O https://reoa92d.com/../uploaded/1649416469.mp4
...
Succeeds

But yt-dlp fails:

$ yt-dlp https://streamwo.com/v/gp445h2f 
[generic] gp445h2f: Requesting header
WARNING: [generic] Falling back on generic information extractor.
[generic] gp445h2f: Downloading webpage
[generic] gp445h2f: Extracting information
[download] Downloading playlist: Streamwo
[generic] playlist Streamwo: Collected 1 videos; downloading 1 of them
[download] Downloading video 1 of 1
[info] gp445h2f: Downloading 1 format(s): 0
ERROR: unable to download video data: HTTP Error 400: Bad Request
[download] Finished downloading playlist: Streamwo

mpv (which uses yt-dlp in it’s ytdl_hook) fails as well:

$ mpv https://streamwo.com/v/gp445h2f                                
[ffmpeg] https: HTTP error 400 Bad Request
Failed to open https://reoa92d.com/../uploaded/1649416469.mp4#t=0.1.

Exiting... (Errors when loading file)

Verbose log

$ yt-dlp -vU https://streamwo.com/v/gp445h2f 
[debug] Command-line config: ['-vU', 'https://streamwo.com/v/gp445h2f']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, err utf-8, pref UTF-8
[debug] yt-dlp version 2022.04.08 [7884ade65] (zip)
[debug] Python version 3.10.4 (CPython 64bit) - Linux-5.15.32-1-lts-x86_64-with-glibc2.35
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg 5.0 (setts), ffprobe 5.0, phantomjs 2.1.1, rtmpdump 2.4
[debug] Optional libraries: mutagen, sqlite, websockets
[debug] Proxy map: {}
Latest version: 2022.04.08, Current version: 2022.04.08
yt-dlp is up to date (2022.04.08)
[debug] [generic] Extracting URL: https://streamwo.com/v/gp445h2f
[generic] gp445h2f: Requesting header
WARNING: [generic] Falling back on generic information extractor.
[generic] gp445h2f: Downloading webpage
[generic] gp445h2f: Extracting information
[debug] Looking for video embeds
[debug] Identified a HTML5 media
[debug] Formats sorted by: hasvid, ie_pref, lang, quality, res, fps, hdr:12(7), vcodec:vp9.2(10), acodec, filesize, fs_approx, tbr, vbr, abr, asr, proto, vext, aext, hasaud, source, id
[download] Downloading playlist: Streamwo
[generic] playlist Streamwo: Collected 1 videos; downloading 1 of them
[download] Downloading video 1 of 1
[debug] Default format spec: bestvideo*+bestaudio/best
[info] gp445h2f: Downloading 1 format(s): 0
[debug] Invoking downloader on "https://reoa92d.com/../uploaded/1649416469.mp4#t=0.1"
ERROR: unable to download video data: HTTP Error 400: Bad Request
Traceback (most recent call last):
  File "/home/koonix/./yt-dlp/yt_dlp/YoutubeDL.py", line 3138, in process_info
    success, real_download = self.dl(temp_filename, info_dict)
  File "/home/koonix/./yt-dlp/yt_dlp/YoutubeDL.py", line 2846, in dl
    return fd.download(name, new_info, subtitle)
  File "/home/koonix/./yt-dlp/yt_dlp/downloader/common.py", line 457, in download
    ret = self.real_download(filename, info_dict)
  File "/home/koonix/./yt-dlp/yt_dlp/downloader/http.py", line 369, in real_download
    establish_connection()
  File "/home/koonix/./yt-dlp/yt_dlp/downloader/http.py", line 128, in establish_connection
    ctx.data = self.ydl.urlopen(request)
  File "/home/koonix/./yt-dlp/yt_dlp/YoutubeDL.py", line 3601, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

[download] Finished downloading playlist: Streamwo

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
dirkfcommented, Apr 9, 2022

… it’s better to handle this directly in the url_opener

That would be appropriate since urllib/urllib2 is the source of the problem. Whenever I trace the code around opener stuff I get that old-style Adventure feeling: YOU ARE IN A MAZE OF TWISTY LITTLE PASSAGES, ALL ALIKE.

Requests knows how to handle .. components.

0reactions
coletdjnzcommented, Jun 22, 2022

This is what’s in the webpage:

                    <video id="my-video" class="video-js vjs-16-9 vjs-big-play-centered" loop controls playsinline preload="auto" data-setup="{}" > 
                        <source src="https://reoa92d.com/../uploaded/1649416469.mp4#t=0.1" type="video/mp4" /> 
                    </video>

This apparently [1] invalid URL should be corrected to https://reoa92d.com/uploaded/1649416469.mp4#t=0.1, which Mozilla does. But compat_urllib_request.Request() doesn’t. The URL specification says that a .. component should shorten the so far parsed URL, which means doing nothing when that URL has no path, as would be the case here.

We could fix such a URL in the core processing of the extracted info_dict (sanitise_url(), eg); alternatively it could be fixed in before opening for download (sanitized_Request(), eg).

urllib.parse.urlparse() doesn’t implement the URL parsing algorithm as specified, even when there is a path component before the ..:

$ python3.9
Python 3.9.7 (default, Sep  4 2021, 18:19:10) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse as urlparse
>>> urlparse.urlparse('http://som.dom.com/path/../no/this/path')
ParseResult(scheme='http', netloc='som.dom.com', path='/path/../no/this/path', params='', query='', fragment='')
>>>
1. Only apparently because the WhatWG specs essentially make all invalid Web constructs valid for backward compatibility (quirks).

some more examples: https://datatracker.ietf.org/doc/html/rfc3986/#section-5.2.4

urljoin has some basic support for this, but it won’t work in all situations

https://stackoverflow.com/a/40536115

Requests knows how to handle … components.

#3668 will technically resolve this for many

Edit: https://github.com/urllib3/urllib3/blob/314bc8ee91a728f51c2cf04b42353c7b2e12c76b/src/urllib3/util/url.py#L263-L290 is urllib3’s implementation. Based on the pseudo-code in the RFC linked above

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issue 35748: urlparse library detecting wrong hostname leads ...
If I am reading this correctly: https://tools.ietf.org/html/rfc1738#section-3.1 the colon after the username can be omitted, so the URL is legal ...
Read more >
Segmentation fault when looking up host name and IP address
After I added that header to my C program, it compiled and run fine. ... It also makes it easier to debug and...
Read more >
Issue with ':' character in Bash script over SSH
Your source file has a colon in its name, so scp is trying to parse it as the hostname and filename of a...
Read more >
Realm configuration decisions — MIT Kerberos Documentation
Before installing Kerberos V5, it is necessary to consider the following issues: The name of your Kerberos realm (or the name of each...
Read more >
The Anatomy of a Full Path URL - By zvelo
Protocol, Domain, Hostname, Subdomain, Path, and more. ... After all, understanding how a URL is structured is an important step to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found