question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Here is what I'm trying to do with regexes, and why I should use Lark

See original GitHub issue
from lark import Transformer, Lark

class Var:
   def __init__(self, s:str, type:str=None):
      self._str = s
      self._type = type
      
   def __str__(self):
      return self._str
   
   def type(self):
      return self._type
   
class TextString:
   def __init__(self, s:str):
      self._str = s
   
   def __str__(self):
      return self._str
   
class LaTeXString:
   def __init__(self, s:str):
      self._str = s
      
   def __str__(self):
      return self._str
   
class VarParser:
   space_regex = re.compile(r"\s\s+")
   text_regex = re.compile(r"(\\(text|textbf|operatorname){[^}]+})")
   latex_regex = re.compile(r"(\$[^$]+\$|\$\$[^$]+\$\$)")  
   var_regex = re.compile(
      r"[a-zA-Z]|\\alpha|\\beta|\\gamma|\\delta|\\epsilon|\\zeta|\\eta|"
      r"\\theta|\\iota|\\kappa|\\lambda|\\mu|\\xi|\\omicron|\\pi|\\rho|"
      r"\\sigma|\\tau|\\upsilon|\\phi|\\psi|\\chi|\\omega|"
      r"\\Alpha|\\Beta|\\Gamma|\\Delta|\\Epsilon|\\Zeta|\\Eta|\\Theta|"
      r"\\Iota|\\Kappa|\\Lambda|\\Mu|\\Xi|\\Omicron|\\Pi|\\Rho|\\Sigma|"
      r"\\Tau|\\Upsilon|\\Phi|\\Psi|\\Chi|\\Omega")

   def to_single_space(self, s):
      return " ".join(s.split())
   
   def parse_latex_parts(self, s):
      s = self.to_single_space(s)      
      match_iter = self.latex_regex.finditer(s)
      parts = []
      start = 0
      for match in match_iter:
         if match.span()[0] > 0:
            parts.append(s[start : match.span()[0]])
         parts.append(LaTeXString(match.group()))
         start = match.span()[1] + 1
      if start < len(s):
         parts.append(s[start:])
      return parts
   
   def parse_text_parts(self, latex_parts):
      text_parts = 

if __name__ == '__main__':
   while True:
      parser = VarParser()
      s = input("s=")
      list = parser.parse(s)
      print(list)

As you can see the code is becoming more complex than necessary given that Lark exists.

$\text{abc} a$ => ["\text{abc} ", Var(“a”)] is the result I want.

Anything surrounded by $ or $$ in MathJax is considered LaTeX. Anything latin or greek (\alpha e.g.) character I encounter inside dollar signs will be a variable. Except \text{…} might occur inside of dollar signs, in which case anthing inside of … should not be parsed as a variable, but the entire string \text{…} is left alone. Same goes for \textbf and \operatorname.

So as you can see, using regexes has led me to the above code, in fact the simplest way you could do it, and the code is not finished (it doesn’t run because I didn’t finish it). Therefore because of all this complicatedness, I think Lark suits me best. However, I’ve tried several times to do this in Lark and failed for one reason or another.

I will be posting the Lark version in about 30 mins. So keep this open, please.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
MegaIngcommented, Nov 22, 2019

To fix this issue, use lexer='contextual as argument to Lark. ( I though that would be default, @erezsh?)

(Btw, this currently will not correctly recognize the greek letters.)

1reaction
MegaIngcommented, Nov 22, 2019

This grammar should work:

  ?start: string
  ?string: (latex_string | any_str)+
  ?latex_string: block_latex
               | inline_latex
  ?inline_latex: "$" inner_latex+ "$"
  ?block_latex: "$$" inner_latex+ "$$"
  ?inner_latex: variable 
              | integer
              | text_block
              | any_str_not_var
  ?variable: (LATIN | greek)
  ?text_block: "\\\\" TEXT_COMMAND "{" ANY_STR "}"
  integer: SIGNED_INT
  ?greek: GREEK_LOWER     -> greek
        | GREEK_UPPER     -> greek
  GREEK_LOWER: /\\\\alpha|\\\\beta|\\\\gamma|\\\\delta|\\\\epsilon|\\\\zeta|\\\\eta|\\\\theta/
              | /\\\\iota|\\\\kappa|\\\\lambda|\\\\mu|\\\\xi|\\\\omicron|\\\\pi|\\\\rho|\\\\sigma/
              | /\\\\tau|\\\\upsilon|\\\\phi|\\\\psi|\\\\chi|\\\\omega/
  GREEK_UPPER: /\\\\Alpha|\\\\Beta|\\\\Gamma|\\\\Delta|\\\\Epsilon|\\\\Zeta|\\\\Eta|\\\\Theta/
             | /\\\\Iota|\\\\Kappa|\\\\Lambda|\\\\Mu|\\\\Xi|\\\\Omicron|\\\\Pi|\\\\Rho|\\\\Sigma/
             | /\\\\Tau|\\\\Upsilon|\\\\Phi|\\\\Psi|\\\\Chi|\\\\Omega/
  TEXT_COMMAND: /text|textbf|operatorname/
  LATIN: /[a-zA-Z]/
  any_str: ANY_STR
  ANY_STR: /[^{}$]+/
  any_str_not_var: ANY_STR_NOT_VAR
  ANY_STR_NOT_VAR: /[^{}$a-zA-Z]+/
  %import common.SIGNED_INT
  %import common.WS_INLINE
  %ignore WS_INLINE

Cant test right now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

lark-parser/Lobby - Gitter
I am trying to parse resume using lark parser.I want Grammer to parse email I'd and phone number ... @mohanaranganv You can use...
Read more >
lark grammar: How does the escaped string regex work?
So for input "abc\"xyz" , we do not want to match only "abc\" , because the \" is escaped. We observe that the...
Read more >
Regular Expression (Regex) Tutorial
To match a character having special meaning in regex, you need to use a escape sequence ... Step 3: Perform matching and Process...
Read more >
Grammar Reference - Lark documentation - Read the Docs
A grammar is a list of rules and terminals, that together define a language. Terminals define the alphabet of the language, while rules...
Read more >
perlre - Perl regular expressions - Perldoc Browser
means to use the current locale's rules (see perllocale) when pattern matching. For example, \w will match the "word" characters of that locale,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found