Bug: Url host name parsing result is incorrect when hostname contains unicode character
See original GitHub issueBug Report
Description
Url host name parsing result is incorrect when hostname contains unicode
Steps to Reproduce
var url1 = new Url("http://ec².com");
var url2 = new Url("http://a.com/ec²");
Console.WriteLine(url1.Href);
Console.WriteLine(url2.Href);
Expected output:
http://ec².com/
(or maybe http://ec2.com/ which will be the final host name sent for dns lookup)
http://a.com/ec%C2%B2
Actual output:
http://ec.com/
http://a.com/ec%C2%B2
Note that ²
is missing.
Environment details: Windows 10 .Net Framework 4.6.2
Possible Solution
Problem only exist in hostname part
I saw comment:
//TODO finish with
//https://url.spec.whatwg.org/#concept-host-parser
I guess host name with Unicode characters is not implemented yet. Do you have any plan on that or do you need help on that?
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Unicode characters in URLs
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it...
Read more >URL | Node.js v16 API
Invalid host name values assigned to the hostname property are ignored. M url.href. string. Gets and sets the serialized URL.
Read more >urllib.parse — Parse URLs into components — Python 3.11.4 ...
The URL parsing functions focus on splitting a URL string into its components, ... into Unicode characters, as accepted by the bytes.decode() method....
Read more >Understanding the Fragmented Space of URL Parser ...
URLs with ambiguous hostnames can trick the popular Google Safe ... choosing to escape unicode characters in URLs before parsing them.
Read more >2.6 URLs — HTML5
If parsing url resulted in a <host> component, then replace the matching substring of url with the string that results from expanding any...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Please reopen this issue.
After reading the host-parsing part of spec
I found the current implementation still have a problem.
Consider this example input:
Spec will consider this host name as invalid but current
Url
will think it is valid.The problem is on https://github.com/AngleSharp/AngleSharp/blob/devel/src/AngleSharp/Url.cs#L1062 Current
TrySanatizeHost
function is doing a 2-passes validation of host name, while in the spec it is a 3 passes process:For example above, the result of step 2 may contains forbidden host code point.
?
will be mapped to?
and then should cause parse failure.If you don’t have concern, I would like to implement the correct domain parsing including the ipv4 and ipv6 part.
This plan sounds good to me now. I will still spend some time on the URL standard and
IdnMapping
Class to see if there are any potential issue.Thanks!