Html Parser Strips CRLF and replaces with LF
See original GitHub issueThis seems to be by design, but I’m wondering as to the intention here. I reviewed #149 and see why this should apply to XML, but perhaps it shouldn’t to HTML.
I was working on a project that uses Premailer.net (which in turn uses AngleSharp) to inline some CSS for emails. Every so often, we would receive errors regarding Maximum Line Length (see RFC 2822 2.1.1 https://www.ietf.org/rfc/rfc2822.txt) for some of these emails (a few email servers were rejecting the emails we were sending out) and I ended up investigating this in depth.
Using Premailer, we’d generate the HTML as a string and use a different third party library to then send these out to interested parties. As inputs to Premailer, we would send out emails with CRLF
Example: <html><head>etc</head>\r\n<body>yaddayadda</body></html>\r\n
. After being processed by Premailer, all newlines would be replaced with \n.
Yet this presents a problem as technically CRLF is end-of-line marker per RFC 2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html). It seems like most are lax in following this rule, where others follow it more strictly.
After investigating, we found the cause to be a result of calling NormalizeForward in BaseTokenizer in AngleSharp, which normalizes all forms of newline to LF.
While I’m not 100% confident in my analysis, I figured reaching out wouldn’t hurt. It seems like one of the email clients we are using will replace LF with CRLF anyway, but for another, we have had sporadic delivery issues.
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Maybe I am misunderstanding something fundamental however. AngleSharp is returning HTML, as it should. Protocols like HTTP/SMTP are perhaps not concerns of this service. Perhaps then writing a new formatter would be best.
Again, thanks for your consideration
Cool thanks for following up on this! Never knew about the 998 char limit - thanks!