question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Non-latin characters get HTML-encoded with decodeEntities=true

See original GitHub issue

Hi,

Consider this code:

var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html())

It prints &#x430;&#x431;&#x432;. If I set decodeEntities to false, the output will be the expected абв.

Two issues here:

  1. This is not how decodeEntities is supposed to work. I tested with htmlparser2 directly, and it works as expected both ways (the code is below).
  2. Htmlparser recommends always set decodeEntities to true for security reasons.

Test code for htmlparser:


var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    ontext: function(text){
        console.log("-->", text);
    }
}, {decodeEntities: true});
parser.write("<div>абв</div>");
parser.end();

Output: --> абв (as expected)

P.S. Cheerio version: 0.20.0

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:13
  • Comments:27 (5 by maintainers)

github_iconTop GitHub Comments

14reactions
rlidwkacommented, Jan 27, 2017

Did you find a way around this ?

An obvious solution is, undo all the escaping work that dom-serializer does:

var cheerio = require('cheerio')

var cheerio_html = cheerio.prototype.html

cheerio.prototype.html = function wrapped_html() {
  var result = cheerio_html.apply(this, arguments)
 
  if (typeof result === 'string') {
    result = result.replace(/&#x([0-9a-f]{1,6});/ig, function (entity, code) {
      code = parseInt(code, 16)

      // don't unescape ascii characters, assuming that all ascii characters
      // are encoded for a good reason
      if (code < 0x80) return entity

      return String.fromCodePoint(code)
    })
  }

  return result
}

console.log(cheerio.load('<div>абв"&quot;&lt;&gt;</div>').root().html())
// <div>абв&quot;&quot;&lt;&gt;</div>

It modifies cheerio prototype (which might not be desirable in your case), and it could slow down parsing if you’re calling html() a lot. So I’d be happy to know if there’s a better solution out there.

8reactions
shelldwellercommented, Jun 1, 2016

This was reported in #565 and for some reason closed. I agree, .html() should not escape non-ASCII characters to entities, but should return Unicode string, like .text() does.

Read more comments on GitHub >

github_iconTop Results From Across the Web

html_entity_decode - Manual - PHP
html_entity_decode() is the opposite of htmlentities() in that it converts HTML entities in the string to their corresponding characters.
Read more >
HTML Entity Decode - javascript - Stack Overflow
To use this function, just call decodeEntities("&") and it will use the same underlying techniques as the jQuery version will—but without jQuery's overhead, ......
Read more >
Encode or decode strings with HTML entities - metacpan.org
This module deals with encoding and decoding of strings with HTML character entities. The module provides the following functions: decode_entities( $string, ... )....
Read more >
Declaring character encodings in HTML - W3C
If, for some reason, you have no choice, here are some rules for declaring the encoding. They are different from those for other...
Read more >
Html::escape | Html.php | Drupal 8.2.x - Drupal API
Special characters that have already been escaped will be double-escaped (for example, "<" becomes "&lt;"), and invalid UTF-8 encoding will be converted ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found