Here is what I'm trying to do with regexes, and why I should use Lark
See original GitHub issuefrom lark import Transformer, Lark
class Var:
def __init__(self, s:str, type:str=None):
self._str = s
self._type = type
def __str__(self):
return self._str
def type(self):
return self._type
class TextString:
def __init__(self, s:str):
self._str = s
def __str__(self):
return self._str
class LaTeXString:
def __init__(self, s:str):
self._str = s
def __str__(self):
return self._str
class VarParser:
space_regex = re.compile(r"\s\s+")
text_regex = re.compile(r"(\\(text|textbf|operatorname){[^}]+})")
latex_regex = re.compile(r"(\$[^$]+\$|\$\$[^$]+\$\$)")
var_regex = re.compile(
r"[a-zA-Z]|\\alpha|\\beta|\\gamma|\\delta|\\epsilon|\\zeta|\\eta|"
r"\\theta|\\iota|\\kappa|\\lambda|\\mu|\\xi|\\omicron|\\pi|\\rho|"
r"\\sigma|\\tau|\\upsilon|\\phi|\\psi|\\chi|\\omega|"
r"\\Alpha|\\Beta|\\Gamma|\\Delta|\\Epsilon|\\Zeta|\\Eta|\\Theta|"
r"\\Iota|\\Kappa|\\Lambda|\\Mu|\\Xi|\\Omicron|\\Pi|\\Rho|\\Sigma|"
r"\\Tau|\\Upsilon|\\Phi|\\Psi|\\Chi|\\Omega")
def to_single_space(self, s):
return " ".join(s.split())
def parse_latex_parts(self, s):
s = self.to_single_space(s)
match_iter = self.latex_regex.finditer(s)
parts = []
start = 0
for match in match_iter:
if match.span()[0] > 0:
parts.append(s[start : match.span()[0]])
parts.append(LaTeXString(match.group()))
start = match.span()[1] + 1
if start < len(s):
parts.append(s[start:])
return parts
def parse_text_parts(self, latex_parts):
text_parts =
if __name__ == '__main__':
while True:
parser = VarParser()
s = input("s=")
list = parser.parse(s)
print(list)
As you can see the code is becoming more complex than necessary given that Lark exists.
$\text{abc} a$ => ["\text{abc} ", Var(“a”)] is the result I want.
Anything surrounded by $ or $$ in MathJax is considered LaTeX. Anything latin or greek (\alpha e.g.) character I encounter inside dollar signs will be a variable. Except \text{…} might occur inside of dollar signs, in which case anthing inside of … should not be parsed as a variable, but the entire string \text{…} is left alone. Same goes for \textbf and \operatorname.
So as you can see, using regexes has led me to the above code, in fact the simplest way you could do it, and the code is not finished (it doesn’t run because I didn’t finish it). Therefore because of all this complicatedness, I think Lark suits me best. However, I’ve tried several times to do this in Lark and failed for one reason or another.
I will be posting the Lark version in about 30 mins. So keep this open, please.
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (6 by maintainers)
Top GitHub Comments
To fix this issue, use
lexer='contextual
as argument toLark
. ( I though that would be default, @erezsh?)(Btw, this currently will not correctly recognize the greek letters.)
This grammar should work:
Cant test right now.