Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Escaping in rx strings seems inconsistent

See original GitHub issue

I am trying to understand the syntax of rx expressions and either have gotten confused or they actually seem to have inconsistency with escaping. The question boils down to are the expressions supposed to be “raw strings” (passed as is to regex engine) or unquoted.

The best example is probably this one since it demonstrates in one rule

https://github.com/coreruleset/coreruleset/blob/v4.0/dev/rules/REQUEST-934-APPLICATION-ATTACK-GENERIC.conf#L241

“@rx (?i)((?😒(?:sh(?:2(?:.(?😒(?😦?:ft|c)p|hell)|tunnel|exec))?)?|m(?:[bs]|tps?)|vn(?:+ssh)?|n(?:ews|mp)|ips?|ftp|3)|p(?:op(?:3s?|2)|r(?:oxy|es)|h(?:ar|p)|aparazzi|syc)|c(?:ompress.(?:bzip2|zlib)|a(?:llto|p)|id|vs)|t(?:e(?:amspeak|lnet)|urns?|ftp)|f(?:i(?:nger|sh)|(?:ee)?d|tps?)|i(?:rc[6s]?|maps?|pps?|cap|ax)|d(?:a(?:ta|v)|n(?:tp|s)|ict)|m(?:a(?:ilto|ven)|umble|ms)|n(?:e(?:tdoc|ws)|ntps?|fs)|r(?:tm(?:f?p)?|sync|ar|mi)|v(?:iew-source|entrilo|nc)|a(?:ttachment|f[ps]|cap)|b(?:eshare|itcoin|lob)|g(?😮(?:pher)?|lob|it)|u(?:nreal|t2004|dp)|e(?:xpect|d2k)|h(?:ttps?|323)|w(?:ebcal|s?s)|ja(?:bbe)?r|x(?:mpp|ri)|ldap[is]?|ogg|zip)://(?😦?:[\d.]{0,11}(?😦?:\xe2(?:\x92(?:[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5]|[\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]|[\x80\x81\x82\x83\x84\x85\x86\x87])|\x93(?:[\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f]|[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9]|[\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xbf\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe]|[\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4])|\x91(?:[\xaa\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3]|[\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]))|\xe3\x80\x82))+)|[a-z][\w-.]{1,255}:\d{1,5}(?:#?\s*&?@(?😦?:\d{1,3}.){3,3}\d{1,3}|[a-z][\w-.]{1,255}):\d{1,5}/?)+|(?:0x[a-f0-9]{2}.){3}0x[a-f0-9]{2}|(?:0{1,4}\d{1,3}.){3}0{1,4}\d{1,3}|\d{1,3}.(?:\d{1,3}.\d{5}|\d{8})|0x(?:[a-f0-9]{16}|[a-f0-9]{8})|[[a-f\d:]+(?:[\d.]+|%\w+)?]|(?:\x5c\x5c[a-z\d-].?_?)+|\d{10}))” \

We see many raw regex control expressions in the string \d, \s, etc. But we also see byte sequences like \x9c, \x92. Correct me if I’m wrong but strings in most languages aren’t able to actually model what I suppose is the intent of the \x expressions to be unquoted, but the rest read as-is. In Coraza, we are currently just using the ASCII letters \, x, 9, c in the regex right now, and Coraza currently does not seem to work correctly when trying to match unicode because of this.

Note that still unquoting this particular regex will probably still work fine because it’s not ambiugous, but if you had \\ in there, then it would be - unquoting would likely break the regex.

I have a feeling it’s supposed to be a mix of raw strings and non-raw, where everything is raw except for \x (and presumably \u) sequences. Won’t be easy writing this in Go but I guess it’s doable 😃 Would like to confirm this is the actual format - sorry if it’s already documented but wasn’t able to find it.

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

anuraagacommented, Sep 15, 2022

BTW for context, here is how to unquote in Go. We can’t use the normal Unquote function since it expects only valid escape characters.

https://github.com/corazawaf/coraza/pull/425/files#diff-79e29a941e684abd3bbab04c673bbebe97a01d4192839779eb93b7ce2b6ee4fdR89

1reaction

anuraagacommented, Sep 15, 2022

I’ve mostly confirmed that rx strings are indeed quoted strings, with many quoted backslashes kept as raw backslashes when unambiguous to improve readability of the rules. Will go ahead and close this issue, thanks

Top Results From Across the Web

Escaping double quote is not working properly - Stack Overflow

Now my problem is a String "test" should come as & quot;test& quot; but it is coming as & amp;quot;test& amp;quot;. It seems...

Bug: inconsistent escaping of coderef regexp

Bug: inconsistent escaping of coderef regexp @ 2021-01-04 20:33 Tom Gillespie ... The deeper issue is that the format string that appears in...

inconsistency with how APEX_JSON escapes characters

Hi all, I think that I've found an interesting inconsistency with how ... signature that has a string as parameter it escapes the...

1.3 The Reader - Racket Documentation

No escape sequences are recognized between the starting and terminating lines; all characters are included in the string (and terminator) literally. A return ......

Mail Index - Scheme SRFI

Re: Small inconsistency. From: Alex Shinn. Re: Reference implementation dependencies? From: Alex Shinn. "rx". From: Evan Hanson. Escaping literal strings.