question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Escaping in rx strings seems inconsistent

See original GitHub issue

I am trying to understand the syntax of rx expressions and either have gotten confused or they actually seem to have inconsistency with escaping. The question boils down to are the expressions supposed to be “raw strings” (passed as is to regex engine) or unquoted.

The best example is probably this one since it demonstrates in one rule

https://github.com/coreruleset/coreruleset/blob/v4.0/dev/rules/REQUEST-934-APPLICATION-ATTACK-GENERIC.conf#L241

@rx (?i)((?😒(?:sh(?:2(?:.(?😒(?😦?:ft|c)p|hell)|tunnel|exec))?)?|m(?:[bs]|tps?)|vn(?:+ssh)?|n(?:ews|mp)|ips?|ftp|3)|p(?:op(?:3s?|2)|r(?:oxy|es)|h(?:ar|p)|aparazzi|syc)|c(?:ompress.(?:bzip2|zlib)|a(?:llto|p)|id|vs)|t(?:e(?:amspeak|lnet)|urns?|ftp)|f(?:i(?:nger|sh)|(?:ee)?d|tps?)|i(?:rc[6s]?|maps?|pps?|cap|ax)|d(?:a(?:ta|v)|n(?:tp|s)|ict)|m(?:a(?:ilto|ven)|umble|ms)|n(?:e(?:tdoc|ws)|ntps?|fs)|r(?:tm(?:f?p)?|sync|ar|mi)|v(?:iew-source|entrilo|nc)|a(?:ttachment|f[ps]|cap)|b(?:eshare|itcoin|lob)|g(?😮(?:pher)?|lob|it)|u(?:nreal|t2004|dp)|e(?:xpect|d2k)|h(?:ttps?|323)|w(?:ebcal|s?s)|ja(?:bbe)?r|x(?:mpp|ri)|ldap[is]?|ogg|zip)://(?😦?:[\d.]{0,11}(?😦?:\xe2(?:\x92(?:[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5]|[\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]|[\x80\x81\x82\x83\x84\x85\x86\x87])|\x93(?:[\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f]|[\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9]|[\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b]|[\xbf\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe]|[\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4])|\x91(?:[\xaa\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3]|[\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf]))|\xe3\x80\x82))+)|[a-z][\w-.]{1,255}:\d{1,5}(?:#?\s*&?@(?😦?:\d{1,3}.){3,3}\d{1,3}|[a-z][\w-.]{1,255}):\d{1,5}/?)+|(?:0x[a-f0-9]{2}.){3}0x[a-f0-9]{2}|(?:0{1,4}\d{1,3}.){3}0{1,4}\d{1,3}|\d{1,3}.(?:\d{1,3}.\d{5}|\d{8})|0x(?:[a-f0-9]{16}|[a-f0-9]{8})|[[a-f\d:]+(?:[\d.]+|%\w+)?]|(?:\x5c\x5c[a-z\d-].?_?)+|\d{10}))” \

We see many raw regex control expressions in the string \d, \s, etc. But we also see byte sequences like \x9c, \x92. Correct me if I’m wrong but strings in most languages aren’t able to actually model what I suppose is the intent of the \x expressions to be unquoted, but the rest read as-is. In Coraza, we are currently just using the ASCII letters \, x, 9, c in the regex right now, and Coraza currently does not seem to work correctly when trying to match unicode because of this.

Note that still unquoting this particular regex will probably still work fine because it’s not ambiugous, but if you had \\ in there, then it would be - unquoting would likely break the regex.

I have a feeling it’s supposed to be a mix of raw strings and non-raw, where everything is raw except for \x (and presumably \u) sequences. Won’t be easy writing this in Go but I guess it’s doable 😃 Would like to confirm this is the actual format - sorry if it’s already documented but wasn’t able to find it.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
anuraagacommented, Sep 15, 2022

BTW for context, here is how to unquote in Go. We can’t use the normal Unquote function since it expects only valid escape characters.

https://github.com/corazawaf/coraza/pull/425/files#diff-79e29a941e684abd3bbab04c673bbebe97a01d4192839779eb93b7ce2b6ee4fdR89

1reaction
anuraagacommented, Sep 15, 2022

I’ve mostly confirmed that rx strings are indeed quoted strings, with many quoted backslashes kept as raw backslashes when unambiguous to improve readability of the rules. Will go ahead and close this issue, thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Escaping double quote is not working properly - Stack Overflow
Now my problem is a String "test" should come as & quot;test& quot; but it is coming as & amp;quot;test& amp;quot;. It seems...
Read more >
Bug: inconsistent escaping of coderef regexp
Bug: inconsistent escaping of coderef regexp @ 2021-01-04 20:33 Tom Gillespie ... The deeper issue is that the format string that appears in...
Read more >
inconsistency with how APEX_JSON escapes characters
Hi all, I think that I've found an interesting inconsistency with how ... signature that has a string as parameter it escapes the...
Read more >
1.3 The Reader - Racket Documentation
No escape sequences are recognized between the starting and terminating lines; all characters are included in the string (and terminator) literally. A return ......
Read more >
Mail Index - Scheme SRFI
Re: Small inconsistency. From: Alex Shinn. Re: Reference implementation dependencies? From: Alex Shinn. "rx". From: Evan Hanson. Escaping literal strings.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found