Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Character Token Position.Position is off by 1 where underlying stream content for said token started with '\r'

See original GitHub issue

Bug Report

Prerequisites

[Y] Can you reproduce the problem in a MWE?
[Y] Are you running the latest version of AngleSharp?
[Y] Did you check the FAQs to see if that helps you?
[Y] Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
[Y] Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

Tokenization of TextSource appears to be “eating”/suppressing carriage return - but only if it is the first character of a Character token.

Steps to Reproduce

Create a memory stream with an initial value/buffer of <html><body><p>\r\nThis is test 1<p> \r\nThis is test 2</body></html>.
Create a TextSource instance passing the memory stream as the first parameter.
Enumerate the tokens returned from calling Tokenize() extension method on that TextSource.

For each token, the token’s Position.Position-1 (separate bug to be reported here) should be the index back into the TextSource for the starting character of the token and the TextSource’s Index should be the index for the first characters after the end of the current token. As such, you should be able to get the “raw text” of the token by doing a substring on TextSource’s Text() property using the start index and computing the length (end index - start index).

Expected behavior: [What you expected to happen] The “raw text” of the first Character Token should be "\r\nThis is test 1". The “raw text” of the second Character Token should be " \r\nThis is test 2".

Actual behavior: [What actually happened] The “raw text” of the first Character Token is actually "\nThis is test 1". The “raw text” of the second Character Token is the expected " \r\nThis is test 2".

Looking at the Character Token’s Data property it appears that the carriage return is always suppressed (or if only carriage return appears and no following linefeed then the carriage return is replaced with linefeed. As such, it is not initially obvious that the token’s reported starting position is off by one - however this is a problem if you ever need to inject something before the start of the character token as you would be inserting into the middle of a CRLF sequence.

Environment details: [OS, .NET Runtime, …] Windows, Visual Studio 2019, .NET Framework 4.6, Any CPU (pref 64-bit)

Possible Solution

I took a very quick peek into the code and it appears that this might have something do due with AngleSharp.Common.BaseTokenizer.NormalizeForward which seems to skip over the carriage return character. I did not however attempt to trace down through.

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

cmwoodscommented, May 1, 2019

The key lies in two things:

Your testing is looking at the HtmlToken.Data whereas I’m looking at the “raw text” from the TextSource corresponding to the token:

...
var token = t.Get();
int tokenStart = token.Position.Position - 1;
string rawText = s.Text.Substring(tokenStart, s.Index - tokenStart);
...

I should have expressed the issue better - the crux of the issue lies buried in the following paragraph:

Looking at the Character Token’s Data property it appears that the carriage return is always suppressed (or if only carriage return appears and no following linefeed then the carriage return is replaced with linefeed). As such, it is not initially obvious that the token’s reported starting position is off by one - however this is a problem if you ever need to inject something before the start of the character token as you would be inserting into the middle of a CRLF sequence.

I should have more clearly stated that the problem is the reported token start position being incorrect and then saying it is easy to see that this has happened by looking at the “raw text” based on the token’s returned start position. The other way to notice this is of course if you explicitly count out the expected Position.Position values for each and every token and then compare the actual vs. expected however I found that, for me at least, it is easier to check the “raw text” as that test can be generalized against any HTML input stream.

1reaction

FlorianRapplcommented, Apr 30, 2019

Actually both CRs should be suppressed. I will look into this.

By W3C spec CRLF needs to be normalized to LF.