Add support for "regex" library
See original GitHub issueCode Sample, a copy-pastable example if possible
import re
import pandas as pd
import regex
df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "1", "2"]})
pattern = r"\d"
df.b.str.match(pattern)
df.b.str.match(re.compile(pattern))
df.b.str.match(regex.compile(pattern)) # throws typeError
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-eec2b9ae9613> in <module>()
9 df.b.str.match(pattern)
10 df.b.str.match(re.compile(pattern))
---> 11 df.b.str.match(regex.compile(pattern))
~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in match(self, pat, case, flags, na, as_indexer)
2421 def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
2422 result = str_match(self._data, pat, case=case, flags=flags, na=na,
-> 2423 as_indexer=as_indexer)
2424 return self._wrap_result(result)
2425
~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in str_match(arr, pat, case, flags, na, as_indexer)
736 flags |= re.IGNORECASE
737
--> 738 regex = re.compile(pat, flags=flags)
739
740 if (as_indexer is False) and (regex.groups > 0):
~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
298 return pattern
299 if not sre_compile.isstring(pattern):
--> 300 raise TypeError("first argument must be string or compiled pattern")
301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
TypeError: first argument must be string or compiled pattern
A simpler way to demonstrate the problem is:
re.compile(regex.compile(pattern))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-38578ab20aeb> in <module>()
----> 1 re.compile(regex.compile(pattern))
~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
298 return pattern
299 if not sre_compile.isstring(pattern):
--> 300 raise TypeError("first argument must be string or compiled pattern")
301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
TypeError: first argument must be string or compiled pattern
Problem description
The regex library seems not to be supported by pandas. Not sure if you want to add support for it, but I had a quick look and It seems relatively straight forward to add support for it (+ it would make maintainance for projects that have already opted for regex
easier).
How to fix
So, I think that the steps that seem to be required are:
pandas.core.dtypes.inference.is_re
should return True forregex
compiled patterns too (assuming thatregex
is installed of course).- Make sure that you use call “is_re” before
re.compile()
(as is being done e.g. here):
if not is_re(pat):
pat = re.compile(pat, flags)
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.17.5-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.3
pytest: 3.7.1
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:18 (8 by maintainers)
Top Results From Across the Web
Regular Expression Library
Regular Expression Library provides a searchable database of regular expressions. Users can add, edit, rate, and test regular expressions.
Read more >re — Regular expression operations — Python 3.11.1 ...
Source code: Lib/re/ This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched ......
Read more >Regular expressions library (since C++11)
These classes encapsulate a regular expression and the results of matching a regular expression within a target sequence of characters.
Read more >How to Add the RegEx Library in C++
1. Every application begins with a new project. 2. Code::Blocks displays the Project Build Options dialog box. 3. You see a number of...
Read more >UI Bakery RegEx Library
RegEx Library - a curated list of useful regular expressions for different programming languages. ... The regular expressions below can be used to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I totally forgot about this.
I am willing to take the lead on this, going through the effort to update the docs, run the test suite, etc.
However, I think my patch is a hack around the fact that
regex
objects are not instances oftyping.Pattern
. I can think of of two solutions that are better than the one I originally proposed:typing.Protocol
that covers the relevant methods and attributes used within Pandas.Pattern
type that, unlike the currenttyping.Pattern
, is not an alias tore.Pattern
, but is its own class with a__subclasshook__
implementation, much like the classes incollections.abc
. I think this is generally an improvement over the existingtyping.Pattern
that can (and should) be contributed back to the Python community as a PEP.The reason I believe a generic solution is better than a
regex
-specific solution is that there are yet other regex libraries that someone might want to use (e.g. RE2).I am willing to start work on (1), free time permitting, and possibly even (2). But I’d like some feedback on this idea from the Pandas dev community before I commit a bunch of time for it.
@jbrockmendel did you take a look at my proposed patch? It will probably need a major rebase obviously. Just want to make sure what I did is an acceptable approach before I put more time into it.