Character Token Position.Position is off by 1 where underlying stream content for said token started with '\r'
See original GitHub issueBug Report
Prerequisites
- [Y] Can you reproduce the problem in a MWE?
- [Y] Are you running the latest version of AngleSharp?
- [Y] Did you check the FAQs to see if that helps you?
- [Y] Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g.,
AngleSharp.Css
for CSS support) - [Y] Did you perform a search in the issues?
For more information, see the CONTRIBUTING
guide.
Description
Tokenization of TextSource appears to be “eating”/suppressing carriage return - but only if it is the first character of a Character token.
Steps to Reproduce
- Create a memory stream with an initial value/buffer of
<html><body><p>\r\nThis is test 1<p> \r\nThis is test 2</body></html>
. - Create a TextSource instance passing the memory stream as the first parameter.
- Enumerate the tokens returned from calling
Tokenize()
extension method on that TextSource.
For each token, the token’s Position.Position-1 (separate bug to be reported here) should be the index back into the TextSource for the starting character of the token and the TextSource’s Index should be the index for the first characters after the end of the current token. As such, you should be able to get the “raw text” of the token by doing a substring on TextSource’s Text() property using the start index and computing the length (end index - start index).
Expected behavior: [What you expected to happen]
The “raw text” of the first Character Token should be "\r\nThis is test 1"
.
The “raw text” of the second Character Token should be " \r\nThis is test 2"
.
Actual behavior: [What actually happened]
The “raw text” of the first Character Token is actually "\nThis is test 1"
.
The “raw text” of the second Character Token is the expected " \r\nThis is test 2"
.
Looking at the Character Token’s Data property it appears that the carriage return is always suppressed (or if only carriage return appears and no following linefeed then the carriage return is replaced with linefeed. As such, it is not initially obvious that the token’s reported starting position is off by one - however this is a problem if you ever need to inject something before the start of the character token as you would be inserting into the middle of a CRLF sequence.
Environment details: [OS, .NET Runtime, …] Windows, Visual Studio 2019, .NET Framework 4.6, Any CPU (pref 64-bit)
Possible Solution
I took a very quick peek into the code and it appears that this might have something do due with AngleSharp.Common.BaseTokenizer.NormalizeForward which seems to skip over the carriage return character. I did not however attempt to trace down through.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
The key lies in two things:
I should have more clearly stated that the problem is the reported token start position being incorrect and then saying it is easy to see that this has happened by looking at the “raw text” based on the token’s returned start position. The other way to notice this is of course if you explicitly count out the expected Position.Position values for each and every token and then compare the actual vs. expected however I found that, for me at least, it is easier to check the “raw text” as that test can be generalized against any HTML input stream.
Actually both CRs should be suppressed. I will look into this.
By W3C spec CRLF needs to be normalized to LF.