Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Url host name parsing result is incorrect when hostname contains unicode character

See original GitHub issue

Bug Report

Description

Url host name parsing result is incorrect when hostname contains unicode

Steps to Reproduce

var url1 = new Url("http://ec².com");
var url2 = new Url("http://a.com/ec²");

Console.WriteLine(url1.Href);
Console.WriteLine(url2.Href);

Expected output:

http://ec².com/
(or maybe http://ec2.com/ which will be the final host name sent for dns lookup)
http://a.com/ec%C2%B2

Actual output:

http://ec.com/
http://a.com/ec%C2%B2

Note that ² is missing.

Environment details: Windows 10 .Net Framework 4.6.2

Possible Solution

Problem only exist in hostname part

I saw comment:

//TODO finish with
//https://url.spec.whatwg.org/#concept-host-parser

I guess host name with Unicode characters is not implemented yet. Do you have any plan on that or do you need help on that?

Issue Analytics

State:
Created 4 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

hyspacecommented, Jun 6, 2019

Please reopen this issue.

After reading the host-parsing part of spec

I found the current implementation still have a problem.

Consider this example input:

http://www.exampl？e.com/

Spec will consider this host name as invalid but current Url will think it is valid.

The problem is on https://github.com/AngleSharp/AngleSharp/blob/devel/src/AngleSharp/Url.cs#L1062 Current TrySanatizeHost function is doing a 2-passes validation of host name, while in the spec it is a 3 passes process:

3.5.4 string percent decoding
3.5.5 domain to ASCII
3.5.7 check forbidden host code point

For example above, the result of step 2 may contains forbidden host code point. ？ will be mapped to ? and then should cause parse failure.

If you don’t have concern, I would like to implement the correct domain parsing including the ipv4 and ipv6 part.

1reaction

hyspacecommented, May 31, 2019

This plan sounds good to me now. I will still spend some time on the URL standard and IdnMapping Class to see if there are any potential issue.

Thanks!

Top Results From Across the Web

Unicode characters in URLs

All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it...

URL | Node.js v16 API

Invalid host name values assigned to the hostname property are ignored. M url.href. string. Gets and sets the serialized URL.

urllib.parse — Parse URLs into components — Python 3.11.4 ...

The URL parsing functions focus on splitting a URL string into its components, ... into Unicode characters, as accepted by the bytes.decode() method....

Understanding the Fragmented Space of URL Parser ...

URLs with ambiguous hostnames can trick the popular Google Safe ... choosing to escape unicode characters in URLs before parsing them.

2.6 URLs — HTML5

If parsing url resulted in a <host> component, then replace the matching substring of url with the string that results from expanding any...