Implement PCRE style in-regex comments; e.g. (?#comment)
See original GitHub issueThe Python regex implementation we used does not appear to implement any method of having in-regex-text comments which would work in the watchlist and blacklists.1 It would be beneficial for us to be able to include comments in at least our watchlist and blacklist entries, and potentially the other regexes that we use in findspam.py. PCRE implements in-regex comments using comments like (?#comment)
.
It would be relatively easy for us to implement support for PCRE style regex comments. These could be implemented by just removing from the strings we convert to regexes any content which matches the regex \(\?#(?<!(?:[^\\]|^)(?:\\\\)*\\\(\?#)[^)]*\)
.2
This substitution could be performed at one of the following points (listed in in order of increasing generality):
- For watchlist and blacklists only: when we read the watchlist and blacklist lines from the files
- All
'regex'
detections: just prior to usingregex.compile()
on the text provided in all the'regex'
detections, or - All regexes: as a wrapper to
regex.compile()
.
- There is the possibility of “Verbose” regexes using the
X
flag, which I assume is also available in theregex
module we’re using. However, using these would not address having comments in the watchlist and blacklists. - That regex is untested, as it relies on variable length look-behinds for which I don’t have a simulator/tester. The regex
\(\?#(?<!^\\\(\?#)(?<![^\\]\\\(\?#)(?<!\\\\\\\(\?#)[^)]*\)
is tested and correctly matches, or not, for up to 3\
escapes prior to the `(?#comment).
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Prior to writing this RFE, I had found non-official documentation which said these are not implemented. In addition, the official documentation I looked in didn’t mention them, while mentioning other types of comments (“verbose” regular expressions).
However, having looked in the source code for
regex
, it does appear that this style of comment is already implemented as a standard part of both there
andregex
implementations. After finding it in the source, I also found it in there
documentation.So, there’s no need for this RFE, as it’s already natively supported. So, sorry to waste everyone’s time.
@quartata Assuming that python-pcre does implement just PCRE, it’s unlikely that it would be easier to use it. We currently use capabilities (e.g. variable length look-behind) which are not part of PCRE. If it was me, I’d much rather implement a single regex-replace (which is all that implementing this requires) than take on the known and unknown issues of moving to a different regex implementation.