Escaping in rx strings seems inconsistent
See original GitHub issueI am trying to understand the syntax of rx expressions and either have gotten confused or they actually seem to have inconsistency with escaping. The question boils down to are the expressions supposed to be “raw strings” (passed as is to regex engine) or unquoted.
The best example is probably this one since it demonstrates in one rule
“@rx (?i)((?😒(?:sh(?:2(?:.(?😒(?😦?:ft|c)p|hell)|tunnel|exec))?)?|m(?:[bs]|tps?)|vn(?:+ssh)?|n(?:ews|mp)|ips?|ftp|3)|p(?:op(?:3s?|2)|r(?:oxy|es)|h(?:ar|p)|aparazzi|syc)|c(?:ompress.(?:bzip2|zlib)|a(?:llto|p)|id|vs)|t(?:e(?:amspeak|lnet)|urns?|ftp)|f(?:i(?:nger|sh)|(?:ee)?d|tps?)|i(?:rc[6s]?|maps?|pps?|cap|ax)|d(?:a(?:ta|v)|n(?:tp|s)|ict)|m(?:a(?:ilto|ven)|umble|ms)|n(?:e(?:tdoc|ws)|ntps?|fs)|r(?:tm(?:f?p)?|sync|ar|mi)|v(?:iew-source|entrilo|nc)|a(?:ttachment|f[ps]|cap)|b(?:eshare|itcoin|lob)|g(?😮(?:pher)?|lob|it)|u(?:nreal|t2004|dp)|e(?:xpect|d2k)|h(?:ttps?|323)|w(?:ebcal|s?s)|ja(?:bbe)?r|x(?:mpp|ri)|ldap[is]?|ogg|zip)://(?😦?:[\d.]{0,11}(?😦?:\xe2(?:\x92(?:[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5]|[\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]|[\x80\x81\x82\x83\x84\x85\x86\x87])|\x93(?:[\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f]|[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9]|[\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xbf\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe]|[\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4])|\x91(?:[\xaa\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3]|[\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]))|\xe3\x80\x82))+)|[a-z][\w-.]{1,255}:\d{1,5}(?:#?\s*&?@(?😦?:\d{1,3}.){3,3}\d{1,3}|[a-z][\w-.]{1,255}):\d{1,5}/?)+|(?:0x[a-f0-9]{2}.){3}0x[a-f0-9]{2}|(?:0{1,4}\d{1,3}.){3}0{1,4}\d{1,3}|\d{1,3}.(?:\d{1,3}.\d{5}|\d{8})|0x(?:[a-f0-9]{16}|[a-f0-9]{8})|[[a-f\d:]+(?:[\d.]+|%\w+)?]|(?:\x5c\x5c[a-z\d-].?_?)+|\d{10}))” \
We see many raw regex control expressions in the string \d
, \s
, etc. But we also see byte sequences like \x9c
, \x92
. Correct me if I’m wrong but strings in most languages aren’t able to actually model what I suppose is the intent of the \x
expressions to be unquoted, but the rest read as-is. In Coraza, we are currently just using the ASCII letters \
, x
, 9
, c
in the regex right now, and Coraza currently does not seem to work correctly when trying to match unicode because of this.
Note that still unquoting this particular regex will probably still work fine because it’s not ambiugous, but if you had \\
in there, then it would be - unquoting would likely break the regex.
I have a feeling it’s supposed to be a mix of raw strings and non-raw, where everything is raw except for \x
(and presumably \u
) sequences. Won’t be easy writing this in Go but I guess it’s doable 😃 Would like to confirm this is the actual format - sorry if it’s already documented but wasn’t able to find it.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
BTW for context, here is how to unquote in Go. We can’t use the normal
Unquote
function since it expects only valid escape characters.https://github.com/corazawaf/coraza/pull/425/files#diff-79e29a941e684abd3bbab04c673bbebe97a01d4192839779eb93b7ce2b6ee4fdR89
I’ve mostly confirmed that rx strings are indeed quoted strings, with many quoted backslashes kept as raw backslashes when unambiguous to improve readability of the rules. Will go ahead and close this issue, thanks