question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Url host name parsing result is incorrect when hostname contains unicode character

See original GitHub issue

Bug Report

Description

Url host name parsing result is incorrect when hostname contains unicode

Steps to Reproduce

var url1 = new Url("http://ec².com");
var url2 = new Url("http://a.com/ec²");

Console.WriteLine(url1.Href);
Console.WriteLine(url2.Href);

Expected output:

http://ec².com/
(or maybe http://ec2.com/ which will be the final host name sent for dns lookup)
http://a.com/ec%C2%B2

Actual output:

http://ec.com/
http://a.com/ec%C2%B2

Note that ² is missing.

Environment details: Windows 10 .Net Framework 4.6.2

Possible Solution

Problem only exist in hostname part

I saw comment:

//TODO finish with
//https://url.spec.whatwg.org/#concept-host-parser

I guess host name with Unicode characters is not implemented yet. Do you have any plan on that or do you need help on that?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
hyspacecommented, Jun 6, 2019

Please reopen this issue.

After reading the host-parsing part of spec

I found the current implementation still have a problem.

Consider this example input:

http://www.exampl?e.com/

Spec will consider this host name as invalid but current Url will think it is valid.

The problem is on https://github.com/AngleSharp/AngleSharp/blob/devel/src/AngleSharp/Url.cs#L1062 Current TrySanatizeHost function is doing a 2-passes validation of host name, while in the spec it is a 3 passes process:

  1. 3.5.4 string percent decoding
  2. 3.5.5 domain to ASCII
  3. 3.5.7 check forbidden host code point

For example above, the result of step 2 may contains forbidden host code point. will be mapped to ? and then should cause parse failure.

If you don’t have concern, I would like to implement the correct domain parsing including the ipv4 and ipv6 part.

1reaction
hyspacecommented, May 31, 2019

This plan sounds good to me now. I will still spend some time on the URL standard and IdnMapping Class to see if there are any potential issue.

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unicode characters in URLs
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it...
Read more >
URL | Node.js v16 API
Invalid host name values assigned to the hostname property are ignored. M url.href. string. Gets and sets the serialized URL.
Read more >
urllib.parse — Parse URLs into components — Python 3.11.4 ...
The URL parsing functions focus on splitting a URL string into its components, ... into Unicode characters, as accepted by the bytes.decode() method....
Read more >
Understanding the Fragmented Space of URL Parser ...
URLs with ambiguous hostnames can trick the popular Google Safe ... choosing to escape unicode characters in URLs before parsing them.
Read more >
2.6 URLs — HTML5
If parsing url resulted in a <host> component, then replace the matching substring of url with the string that results from expanding any...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found