question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve WHATWG Url Spec Conformance

See original GitHub issue

New Feature Proposal

Description

This Proposal contains 2 parts:

  1. Making Url to catch up with latest WHATWG URL Living Standard
  2. Keep tracking failing [web-platform-tests]

Background

After reading the source code, I found that the Url is not aligned with lastest WHATWG URL Living Standard The difference is trivial, but it is blocking me from using Url directly in my project.

The main difference is that Url is not returning parse failure for some invalid input. In current implementation, it is reflected by the wrong value of IsInvalid property after parsing.

For example, Port number validation is not returning failure when port number is larger than 65535.

(new Url("http://example.com:65536")).IsInvalid // expect true, but got false

I heard from @FlorianRappl in another thread about this:

I don’t see any error here. Ports are currently 16-bit, however, who knows when (or if) this will change. It should be up to the specific requester to then block a request (let’s say the port is 99999 - what should we do about it? Just drop the port? note that invalid URLs will result to a default URL -> it will have potential negative side-effects) if its doing something invalid. It’s a potential enhancement, but I would regard it as highly optional and I’m fine with omitting it.

I agree it is highly optional for most of the use cases, but for me it is not. I’m trying to get all links from web pages using AngleSharp.Html and trying to log down all invalid links for security related research. I mean “invalid” by not able to be open in browser. If you try the example in browser’s address bar, it will redirect you to search engine with the invalid URL as keyword. If you have such link in a HTML file like this:

<a href='http://example.con:65536'>link</a>

you will see it links to empty page.

I don’t see any negative side-effects of make this example URL returning parse error here, but I do see that I’m getting wrong result with current Url and I see “potential negative side-effects” of not following the standard. Different than your statement, I found current Browsers is following standard pretty good, and even they fail some test cases, I can find information in Issues like this

I understand that it may not worth it to implement some part of the standard for performance. But instead, developers need to understand which part we are currently not following the Standard and why, so they can make decision easier. Today, I have to find out which part we are not following standard by myself case by case.

My solution will be catch up with “latest living standard as much as possible”, and provide the information about which part we are not following the standard by providing information of “failing test cases” in web-platform-tests, and have documentation of why we are not doing it.

Some of the changes will be trivial, like the example of port number validation, and the change will not affect performance at all. Some of the change may be bigger and will have impact on performance, we can do profiling and make decision.

This change may be painful since AngleSharp is focusing on HTML parsing, not URL. But actually URL is highly related and similar libraries in other languages also have their own URL implementation, like jsdom.

@FlorianRappl mentioned “what if port number can have more space in future”. If it happens, it will be reflected in the Standard first. WHATWG Standard’s principle is provide backward compatibility (so it won’t directly increase the range, some extra mark to declare the new range for sure), so we should still be fine for existing implementation.

Today, I believe AngleSharp.Url is already the closest implementation of WHATWG URL Standard, why not making it even better so that more C# developer can use it? More efforts may be needed to maintain the Url Class, but I think it worth it since this is the only option in C# community today.

I talked to .Net community about the need of a WHATWG Standard Uri implementation. They will not consider it in next milestone but maybe in the future. I would like to push them internally within Microsoft as well, so I wish we can have them to maintain the URL implementation in the future.

Specification

I plan to do following changes:

  1. Add test case that takes WPT URL test json file as input and run all test defined there.
    • currently we already have most of them in tests, but not invalid ones.
    • this will make it easier to update the test cases in the future.
    • will mark failing test cases that we are not going to fix and reason why not fixing it.
  2. Fix failing test cases that we want to cover
    • work items already known: Ipv4 / Ipv6, port, Unicode in hash

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
annevkcommented, Feb 19, 2020

Stumbled upon this thread via a reference from wpt: https://url.spec.whatwg.org/commit-snapshots/ exists.

1reaction
hyspacecommented, Jun 12, 2019

a word of caution regarding the use of “standard”: WHATWG is not a standard, but rather browser vendors working on what could become a standard.

I would like to use term Living Standard to describe WHATWG. Even it is not actual standard, it is factual standard today.

The reason AngleSharp follows WHATWG is because AngleSharp is interested to unlock the same potential of the web that is usually just available to web browsers.

We are looking for a solution follows WHATWG for similar reason. To serve my customer, who is using modern browser today, I need to pay the cost for catching up with the browsers.

when Url was created several years ago the WHATWG URL spec looked quite a bit different in parts; especially the validation. That does not make AngleSharp “wrong”, but rather just following an older version of the released spec.

I did not realize this point at beginning. That’s why I used bugfix prefix of my PRs. I would like to apologize for it. After working on some changes and looking at the changelog of wpt-tests, also by discussing with you, now I’m aware of it. I think it is a good chance to catch up with the latest standard, and I’m happy to contribute.

(whatever will be done now will also be invalid / outdated in weeks / months / years if its not maintained continuously).

Being aware of this, I would like to do some change to make it easier to catch up with the living standard in the future. (the test change I proposed)

This is very important to keep in mind - and important for anything that follows (e.g., time-frozen URLs should be used to refer to parts of the spec).

I have this need as well, bu I did not find any time-frozen URLs about the living standard. Do you have any solution about it?

Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

URL Standard - WhatWG
The URL Standard defines URLs, domains, IP addresses, the application/x-www-form-urlencoded format, and their API.
Read more >
jsdom/whatwg-url: An implementation of the ...
whatwg -url is a full implementation of the WHATWG URL Standard. It can be used standalone, but it also exposes a lot of...
Read more >
One URL standard please - Daniel Stenberg - Haxx
The WHATWG specification is written in a pseudo code style, describing how a parser would “walk” over the string with a state machine...
Read more >
new URL() - WHATWG URL API
I'm messing around with node and I'm trying to get an instance of the URL class (because of those handy properties). Like: const...
Read more >
URL
The URL Standard defines URLs, domains, IP addresses, the application/x-www-form-urlencoded format, and their API. Status of this document. This ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found