Non-latin characters get HTML-encoded with decodeEntities=true
See original GitHub issueHi,
Consider this code:
var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html())
It prints абв
. If I set decodeEntities
to false
, the output will be the expected абв
.
Two issues here:
- This is not how
decodeEntities
is supposed to work. I tested withhtmlparser2
directly, and it works as expected both ways (the code is below). - Htmlparser recommends always set
decodeEntities
totrue
for security reasons.
Test code for htmlparser:
var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
ontext: function(text){
console.log("-->", text);
}
}, {decodeEntities: true});
parser.write("<div>абв</div>");
parser.end();
Output: --> абв
(as expected)
P.S. Cheerio version: 0.20.0
Issue Analytics
- State:
- Created 7 years ago
- Reactions:13
- Comments:27 (5 by maintainers)
Top Results From Across the Web
html_entity_decode - Manual - PHP
html_entity_decode() is the opposite of htmlentities() in that it converts HTML entities in the string to their corresponding characters.
Read more >HTML Entity Decode - javascript - Stack Overflow
To use this function, just call decodeEntities("&") and it will use the same underlying techniques as the jQuery version will—but without jQuery's overhead, ......
Read more >Encode or decode strings with HTML entities - metacpan.org
This module deals with encoding and decoding of strings with HTML character entities. The module provides the following functions: decode_entities( $string, ... )....
Read more >Declaring character encodings in HTML - W3C
If, for some reason, you have no choice, here are some rules for declaring the encoding. They are different from those for other...
Read more >Html::escape | Html.php | Drupal 8.2.x - Drupal API
Special characters that have already been escaped will be double-escaped (for example, "<" becomes "<"), and invalid UTF-8 encoding will be converted ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
An obvious solution is, undo all the escaping work that dom-serializer does:
It modifies cheerio prototype (which might not be desirable in your case), and it could slow down parsing if you’re calling html() a lot. So I’d be happy to know if there’s a better solution out there.
This was reported in #565 and for some reason closed. I agree,
.html()
should not escape non-ASCII characters to entities, but should return Unicode string, like.text()
does.