Improve WHATWG Url Spec Conformance
See original GitHub issueNew Feature Proposal
Description
This Proposal contains 2 parts:
- Making
Url
to catch up with latest WHATWG URL Living Standard - Keep tracking failing [web-platform-tests]
Background
After reading the source code, I found that the Url
is not aligned with lastest WHATWG URL Living Standard
The difference is trivial, but it is blocking me from using Url
directly in my project.
The main difference is that Url
is not returning parse failure
for some invalid input. In current implementation, it is reflected by the wrong value of IsInvalid
property after parsing.
For example, Port number validation is not returning failure when port number is larger than 65535.
(new Url("http://example.com:65536")).IsInvalid // expect true, but got false
I heard from @FlorianRappl in another thread about this:
I don’t see any error here. Ports are currently 16-bit, however, who knows when (or if) this will change. It should be up to the specific requester to then block a request (let’s say the port is 99999 - what should we do about it? Just drop the port? note that invalid URLs will result to a default URL -> it will have potential negative side-effects) if its doing something invalid. It’s a potential enhancement, but I would regard it as highly optional and I’m fine with omitting it.
I agree it is highly optional for most of the use cases, but for me it is not. I’m trying to get all links from web pages using AngleSharp.Html
and trying to log down all invalid links for security related research. I mean “invalid” by not able to be open in browser. If you try the example in browser’s address bar, it will redirect you to search engine with the invalid URL as keyword. If you have such link in a HTML file like this:
<a href='http://example.con:65536'>link</a>
you will see it links to empty page.
I don’t see any negative side-effects of make this example URL returning parse error
here, but I do see that I’m getting wrong result with current Url
and I see “potential negative side-effects” of not following the standard. Different than your statement, I found current Browsers is following standard pretty good, and even they fail some test cases, I can find information in Issues like this
I understand that it may not worth it to implement some part of the standard for performance. But instead, developers need to understand which part we are currently not following the Standard and why, so they can make decision easier. Today, I have to find out which part we are not following standard by myself case by case.
My solution will be catch up with “latest living standard as much as possible”, and provide the information about which part we are not following the standard by providing information of “failing test cases” in web-platform-tests
, and have documentation of why we are not doing it.
Some of the changes will be trivial, like the example of port number validation, and the change will not affect performance at all. Some of the change may be bigger and will have impact on performance, we can do profiling and make decision.
This change may be painful since AngleSharp
is focusing on HTML parsing, not URL. But actually URL is highly related and similar libraries in other languages also have their own URL implementation, like jsdom.
@FlorianRappl mentioned “what if port number can have more space in future”. If it happens, it will be reflected in the Standard first. WHATWG Standard’s principle is provide backward compatibility (so it won’t directly increase the range, some extra mark to declare the new range for sure), so we should still be fine for existing implementation.
Today, I believe AngleSharp.Url
is already the closest implementation of WHATWG URL Standard, why not making it even better so that more C# developer can use it? More efforts may be needed to maintain the Url
Class, but I think it worth it since this is the only option in C# community today.
I talked to .Net community about the need of a WHATWG Standard Uri
implementation. They will not consider it in next milestone but maybe in the future. I would like to push them internally within Microsoft as well, so I wish we can have them to maintain the URL implementation in the future.
Specification
I plan to do following changes:
- Add test case that takes WPT URL test json file as input and run all test defined there.
- currently we already have most of them in tests, but not invalid ones.
- this will make it easier to update the test cases in the future.
- will mark failing test cases that we are not going to fix and reason why not fixing it.
- Fix failing test cases that we want to cover
- work items already known: Ipv4 / Ipv6, port, Unicode in hash
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:15 (14 by maintainers)
Top GitHub Comments
Stumbled upon this thread via a reference from wpt: https://url.spec.whatwg.org/commit-snapshots/ exists.
I would like to use term Living Standard to describe WHATWG. Even it is not actual standard, it is factual standard today.
We are looking for a solution follows WHATWG for similar reason. To serve my customer, who is using modern browser today, I need to pay the cost for catching up with the browsers.
I did not realize this point at beginning. That’s why I used bugfix prefix of my PRs. I would like to apologize for it. After working on some changes and looking at the changelog of wpt-tests, also by discussing with you, now I’m aware of it. I think it is a good chance to catch up with the latest standard, and I’m happy to contribute.
Being aware of this, I would like to do some change to make it easier to catch up with the living standard in the future. (the test change I proposed)
I have this need as well, bu I did not find any time-frozen URLs about the living standard. Do you have any solution about it?
Thanks!