question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sphinxsearch and astral characters

See original GitHub issue

Workaround — solved using code from here:

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  typeCast: function (field, next) {
    if (field.type === 'STRING') {
      return field.buffer().toString('utf-8');
    }
    return next();
  }
})

I’m using mysql2 to connect to sphinx (that’s a search engine that works over mysql 4.1 protocol, although sql syntax differs quite a bit). I was unable to reproduce following issue with a standard mysql server so far.

When I send a text there and get it back, astral characters (U+10000 and up, represented as surrogate pairs) gets replaced with 4 U+FFFD each.

I assume this is a bug in node-mysql2 because node-mysql works correctly in this exact case.

Source code:

//var mysql = require('mysql')
var mysql = require('mysql2')

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10
})

pool.getConnection(function (err, connection) {
  if (err) throw err

  connection.query(`CALL SNIPPETS(('test 😹 αβγ'), 'forum_posts', 'whatever')`,
    function (err, response) {
      if (err) throw err

      console.log(response)
    }
  )
})

Output with mysql module:

[ RowDataPacket { snippet: 'test 😹 αβγ' } ]

Output with mysql2 module:

[ TextRow { snippet: 'test ���� αβγ' } ]

Here’s network traffic:

    00000000  43 00 00 00 0a 32 2e 33  2e 32 2d 69 64 36 34 2d C....2.3 .2-id64-
    00000010  62 65 74 61 20 28 3f 3f  3f 29 00 01 00 00 00 01 beta (?? ?)......
    00000020  02 03 04 05 06 07 08 00  08 82 21 02 00 00 00 00 ........ ..!.....
    00000030  00 00 00 00 00 00 00 00  00 00 01 02 03 04 05 06 ........ ........
    00000040  07 08 09 0a 0b 0c 0d                             .......
00000000  23 00 00 01 cf f3 82 00  00 00 00 00 e0 00 00 00 #....... ........
00000010  00 00 e0 01 00 00 00 00  90 37 25 03 00 00 00 00 ........ .7%.....
00000020  98 36 25 03 00 00 00                             .6%....
    00000047  07 00 00 02 00 00 00 00  00 00 00                ........ ...
00000027  3f 00 00 00 03 43 41 4c  4c 20 53 4e 49 50 50 45 ?....CAL L SNIPPE
00000037  54 53 28 28 27 74 65 73  74 20 f0 9f 98 b9 20 ce TS(('tes t .... .
00000047  b1 ce b2 ce b3 27 29 2c  20 27 66 6f 72 75 6d 5f .....'),  'forum_
00000057  70 6f 73 74 73 27 2c 20  27 77 68 61 74 65 76 65 posts',  'whateve
00000067  72 27 29                                         r')
    00000052  01 00 00 01 01 24 00 00  02 03 64 65 66 00 00 00 .....$.. ..def...
    00000062  07 73 6e 69 70 70 65 74  07 73 6e 69 70 70 65 74 .snippet .snippet
    00000072  0c 21 00 ff 00 00 00 fe  00 00 00 00 00 05 00 00 .!...... ........
    00000082  03 fe 00 00 00 00 11 00  00 04 10 74 65 73 74 20 ........ ...test 
    00000092  f0 9f 98 b9 20 ce b1 ce  b2 ce b3 05 00 00 05 fe .... ... ........
    000000A2  00 00 00 00                                      ....

Edit: added workaround on the top of the post

Edit 2: opened a bugreport against sphinx - http://sphinxsearch.com/bugs/view.php?id=2607

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:19 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
puzrincommented, Feb 9, 2017

IMO worth reporting at shpinx

Done http://sphinxsearch.com/bugs/view.php?id=2607. But, to be honest, they fix public reports veeery sloooow.

It’s better to find the most simple workaround. Doing .query("set character_set_results 'utf8mb4'") after each connection is not cool. Option in createPool would be fine, if it helps.

PS. now we use temporary kludge - encode astrals as entities 😃

1reaction
sidorarescommented, Feb 9, 2017

@rlidwka I guess simple hackish (not very future proof) way to handle this for you might be this

var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js');
CharsetToEncoding[33] = 'utf8'

this would force mysql2 to decode fields with encoding 33 as utf8

Do you know if sphinx server respects connection time encoding flags? What are results if you connect like this:

var mysql = require('mysql2')

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  charset: 'UTF8MB4_GENERAL_CI'
})

///...
Read more comments on GitHub >

github_iconTop Results From Across the Web

blend_chars - Sphinx | Open Source Search Server
Blended characters are indexed both as separators and valid characters. ... Positions for tokens obtained by replacing blended characters with whitespace ...
Read more >
System Properties Comparison Redis vs. Sphinx - DB-Engines
Detailed side-by-side view of Redis and Sphinx.
Read more >
Sphinx search from text with special characters - Stack Overflow
Please help me on sphinx search with extended search mode - I need to find "fathers day" query string from "Today is fathers's...
Read more >
Gnb - River Thames Conditions - Environment Agency - GOV.UK
Toon characters free download, Reisport handguards size chart, Volvo ocean race ... Ales on rails north creek, Astral cloud serpent buy, Ruger 204...
Read more >
macbre/docker-sphinxsearch - GitHub
Docker image for Sphinx search engine. Contribute to macbre/docker-sphinxsearch development by creating an account on GitHub.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found