Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to support unicode/utf-8? like Chinese?

See original GitHub issue

Sample code like this:

parser = Lark('''start: WORD "," WORD "!"
            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''', parser='lalr')
print(parser.parse("Hello,世界!"))

I’m already try a long time, anyone can help on this, thanks!

Issue Analytics

State:
Created 5 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

2reactions

ray-linncommented, Aug 10, 2018

I find out how to code the rule for Unicode in grammer.g, here is the example

LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+

and it outputs:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

0reactions

ruiqurmcommented, Dec 4, 2021

I find out how to code the rule for Unicode in grammer.g, here is the example
LCASE_LETTER: "a".."z"
UCASE_LETTER: "A".."Z"
CN_ZH_LETTER: /[u"\u4e00-\u9fa5"]/
LETTER: UCASE_LETTER | LCASE_LETTER | CN_ZH_LETTER
WORD: LETTER+
and it outputs:
Tree(start, [Token(WORD, 'Hello'), Token(WORD, '世界')])

/[u"\u4e00-\u9fa5"]/ will include the quote marks. You can use /[\u4e00-\u9fa5]/ instead