question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Character Token Position.Position is off by 1 where underlying stream content for said token started with '\r'

See original GitHub issue

Bug Report

Prerequisites

  • [Y] Can you reproduce the problem in a MWE?
  • [Y] Are you running the latest version of AngleSharp?
  • [Y] Did you check the FAQs to see if that helps you?
  • [Y] Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
  • [Y] Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

Tokenization of TextSource appears to be “eating”/suppressing carriage return - but only if it is the first character of a Character token.

Steps to Reproduce

  1. Create a memory stream with an initial value/buffer of <html><body><p>\r\nThis is test 1<p> \r\nThis is test 2</body></html>.
  2. Create a TextSource instance passing the memory stream as the first parameter.
  3. Enumerate the tokens returned from calling Tokenize() extension method on that TextSource.

For each token, the token’s Position.Position-1 (separate bug to be reported here) should be the index back into the TextSource for the starting character of the token and the TextSource’s Index should be the index for the first characters after the end of the current token. As such, you should be able to get the “raw text” of the token by doing a substring on TextSource’s Text() property using the start index and computing the length (end index - start index).

Expected behavior: [What you expected to happen] The “raw text” of the first Character Token should be "\r\nThis is test 1". The “raw text” of the second Character Token should be " \r\nThis is test 2".

Actual behavior: [What actually happened] The “raw text” of the first Character Token is actually "\nThis is test 1". The “raw text” of the second Character Token is the expected " \r\nThis is test 2".

Looking at the Character Token’s Data property it appears that the carriage return is always suppressed (or if only carriage return appears and no following linefeed then the carriage return is replaced with linefeed. As such, it is not initially obvious that the token’s reported starting position is off by one - however this is a problem if you ever need to inject something before the start of the character token as you would be inserting into the middle of a CRLF sequence.

Environment details: [OS, .NET Runtime, …] Windows, Visual Studio 2019, .NET Framework 4.6, Any CPU (pref 64-bit)

Possible Solution

I took a very quick peek into the code and it appears that this might have something do due with AngleSharp.Common.BaseTokenizer.NormalizeForward which seems to skip over the carriage return character. I did not however attempt to trace down through.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
cmwoodscommented, May 1, 2019

The key lies in two things:

  1. Your testing is looking at the HtmlToken.Data whereas I’m looking at the “raw text” from the TextSource corresponding to the token:
...
var token = t.Get();
int tokenStart = token.Position.Position - 1;
string rawText = s.Text.Substring(tokenStart, s.Index - tokenStart);
...
  1. I should have expressed the issue better - the crux of the issue lies buried in the following paragraph:

Looking at the Character Token’s Data property it appears that the carriage return is always suppressed (or if only carriage return appears and no following linefeed then the carriage return is replaced with linefeed). As such, it is not initially obvious that the token’s reported starting position is off by one - however this is a problem if you ever need to inject something before the start of the character token as you would be inserting into the middle of a CRLF sequence.

I should have more clearly stated that the problem is the reported token start position being incorrect and then saying it is easy to see that this has happened by looking at the “raw text” based on the token’s returned start position. The other way to notice this is of course if you explicitly count out the expected Position.Position values for each and every token and then compare the actual vs. expected however I found that, for me at least, it is easier to check the “raw text” as that test can be generalized against any HTML input stream.

1reaction
FlorianRapplcommented, Apr 30, 2019

Actually both CRs should be suppressed. I will look into this.

By W3C spec CRLF needs to be normalized to LF.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Character Token Locations : r/LegoStarWarsVideoGame
Do you think that the character tokens we will unlock in free play mode will be located in the areas where the characters...
Read more >
Rita Skeeter Character Token Glitch? - LEGO Harry Potter
Rita Skeeter is in the antechamber to the Slytherin common room. Go back to that location with a Strength character and hop into...
Read more >
The-Tokenisation-of-Assets-and-Potential-Implications-for- ...
That said, tokenised assets are fully backed by the underlying asset, which is itself collateral to the token issued, instead of just a ......
Read more >
Error conditions in Databricks
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' ...
Read more >
Free dnd token borders. Author: Devin Night. So first some ...
The main reason why a token is used in D&D is typically to keep track of a character's position during combat. Use them...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found