Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Here is what I'm trying to do with regexes, and why I should use Lark

See original GitHub issue

from lark import Transformer, Lark

class Var:
   def __init__(self, s:str, type:str=None):
      self._str = s
      self._type = type
      
   def __str__(self):
      return self._str
   
   def type(self):
      return self._type
   
class TextString:
   def __init__(self, s:str):
      self._str = s
   
   def __str__(self):
      return self._str
   
class LaTeXString:
   def __init__(self, s:str):
      self._str = s
      
   def __str__(self):
      return self._str
   
class VarParser:
   space_regex = re.compile(r"\s\s+")
   text_regex = re.compile(r"(\\(text|textbf|operatorname){[^}]+})")
   latex_regex = re.compile(r"(\$[^$]+\$|\$\$[^$]+\$\$)")  
   var_regex = re.compile(
      r"[a-zA-Z]|\\alpha|\\beta|\\gamma|\\delta|\\epsilon|\\zeta|\\eta|"
      r"\\theta|\\iota|\\kappa|\\lambda|\\mu|\\xi|\\omicron|\\pi|\\rho|"
      r"\\sigma|\\tau|\\upsilon|\\phi|\\psi|\\chi|\\omega|"
      r"\\Alpha|\\Beta|\\Gamma|\\Delta|\\Epsilon|\\Zeta|\\Eta|\\Theta|"
      r"\\Iota|\\Kappa|\\Lambda|\\Mu|\\Xi|\\Omicron|\\Pi|\\Rho|\\Sigma|"
      r"\\Tau|\\Upsilon|\\Phi|\\Psi|\\Chi|\\Omega")

   def to_single_space(self, s):
      return " ".join(s.split())
   
   def parse_latex_parts(self, s):
      s = self.to_single_space(s)      
      match_iter = self.latex_regex.finditer(s)
      parts = []
      start = 0
      for match in match_iter:
         if match.span()[0] > 0:
            parts.append(s[start : match.span()[0]])
         parts.append(LaTeXString(match.group()))
         start = match.span()[1] + 1
      if start < len(s):
         parts.append(s[start:])
      return parts
   
   def parse_text_parts(self, latex_parts):
      text_parts = 

if __name__ == '__main__':
   while True:
      parser = VarParser()
      s = input("s=")
      list = parser.parse(s)
      print(list)

As you can see the code is becoming more complex than necessary given that Lark exists.

$\text{abc} a$ => ["\text{abc} ", Var(“a”)] is the result I want.

Anything surrounded by $ or $$ in MathJax is considered LaTeX. Anything latin or greek (\alpha e.g.) character I encounter inside dollar signs will be a variable. Except \text{…} might occur inside of dollar signs, in which case anthing inside of … should not be parsed as a variable, but the entire string \text{…} is left alone. Same goes for \textbf and \operatorname.

So as you can see, using regexes has led me to the above code, in fact the simplest way you could do it, and the code is not finished (it doesn’t run because I didn’t finish it). Therefore because of all this complicatedness, I think Lark suits me best. However, I’ve tried several times to do this in Lark and failed for one reason or another.

I will be posting the Lark version in about 30 mins. So keep this open, please.

Issue Analytics

State:
Created 4 years ago
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

MegaIngcommented, Nov 22, 2019

To fix this issue, use lexer='contextual as argument to Lark. ( I though that would be default, @erezsh?)

(Btw, this currently will not correctly recognize the greek letters.)

1reaction

MegaIngcommented, Nov 22, 2019

This grammar should work:

  ?start: string
  ?string: (latex_string | any_str)+
  ?latex_string: block_latex
               | inline_latex
  ?inline_latex: "$" inner_latex+ "$"
  ?block_latex: "$$" inner_latex+ "$$"
  ?inner_latex: variable 
              | integer
              | text_block
              | any_str_not_var
  ?variable: (LATIN | greek)
  ?text_block: "\\\\" TEXT_COMMAND "{" ANY_STR "}"
  integer: SIGNED_INT
  ?greek: GREEK_LOWER     -> greek
        | GREEK_UPPER     -> greek
  GREEK_LOWER: /\\\\alpha|\\\\beta|\\\\gamma|\\\\delta|\\\\epsilon|\\\\zeta|\\\\eta|\\\\theta/
              | /\\\\iota|\\\\kappa|\\\\lambda|\\\\mu|\\\\xi|\\\\omicron|\\\\pi|\\\\rho|\\\\sigma/
              | /\\\\tau|\\\\upsilon|\\\\phi|\\\\psi|\\\\chi|\\\\omega/
  GREEK_UPPER: /\\\\Alpha|\\\\Beta|\\\\Gamma|\\\\Delta|\\\\Epsilon|\\\\Zeta|\\\\Eta|\\\\Theta/
             | /\\\\Iota|\\\\Kappa|\\\\Lambda|\\\\Mu|\\\\Xi|\\\\Omicron|\\\\Pi|\\\\Rho|\\\\Sigma/
             | /\\\\Tau|\\\\Upsilon|\\\\Phi|\\\\Psi|\\\\Chi|\\\\Omega/
  TEXT_COMMAND: /text|textbf|operatorname/
  LATIN: /[a-zA-Z]/
  any_str: ANY_STR
  ANY_STR: /[^{}$]+/
  any_str_not_var: ANY_STR_NOT_VAR
  ANY_STR_NOT_VAR: /[^{}$a-zA-Z]+/
  %import common.SIGNED_INT
  %import common.WS_INLINE
  %ignore WS_INLINE

Cant test right now.