question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add support for "regex" library

See original GitHub issue

Code Sample, a copy-pastable example if possible

import re
import pandas as pd
import regex

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "1", "2"]})
pattern = r"\d"

df.b.str.match(pattern)
df.b.str.match(re.compile(pattern))
df.b.str.match(regex.compile(pattern))     # throws typeError
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-65-eec2b9ae9613> in <module>()
      9 df.b.str.match(pattern)
     10 df.b.str.match(re.compile(pattern))
---> 11 df.b.str.match(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in match(self, pat, case, flags, na, as_indexer)
   2421     def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
   2422         result = str_match(self._data, pat, case=case, flags=flags, na=na,
-> 2423                            as_indexer=as_indexer)
   2424         return self._wrap_result(result)
   2425 

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in str_match(arr, pat, case, flags, na, as_indexer)
    736         flags |= re.IGNORECASE
    737 
--> 738     regex = re.compile(pat, flags=flags)
    739 
    740     if (as_indexer is False) and (regex.groups > 0):

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
    298         return pattern
    299     if not sre_compile.isstring(pattern):
--> 300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

A simpler way to demonstrate the problem is:

re.compile(regex.compile(pattern))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-38578ab20aeb> in <module>()
----> 1 re.compile(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
    298         return pattern
    299     if not sre_compile.isstring(pattern):
--> 300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

Problem description

The regex library seems not to be supported by pandas. Not sure if you want to add support for it, but I had a quick look and It seems relatively straight forward to add support for it (+ it would make maintainance for projects that have already opted for regex easier).

How to fix

So, I think that the steps that seem to be required are:

  1. pandas.core.dtypes.inference.is_re should return True for regex compiled patterns too (assuming that regex is installed of course).
  2. Make sure that you use call “is_re” before re.compile() (as is being done e.g. here):
if not is_re(pat):
    pat = re.compile(pat, flags)

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.17.5-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.7.1
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:4
  • Comments:18 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
gwerbincommented, Mar 18, 2021

I totally forgot about this.

I am willing to take the lead on this, going through the effort to update the docs, run the test suite, etc.

However, I think my patch is a hack around the fact that regex objects are not instances of typing.Pattern. I can think of of two solutions that are better than the one I originally proposed:

  1. Use a runtime-checkable typing.Protocol that covers the relevant methods and attributes used within Pandas.
  2. Implement a Pattern type that, unlike the current typing.Pattern, is not an alias to re.Pattern, but is its own class with a __subclasshook__ implementation, much like the classes in collections.abc. I think this is generally an improvement over the existing typing.Pattern that can (and should) be contributed back to the Python community as a PEP.

The reason I believe a generic solution is better than a regex-specific solution is that there are yet other regex libraries that someone might want to use (e.g. RE2).

I am willing to start work on (1), free time permitting, and possibly even (2). But I’d like some feedback on this idea from the Pandas dev community before I commit a bunch of time for it.

1reaction
gwerbincommented, Apr 17, 2020

@jbrockmendel did you take a look at my proposed patch? It will probably need a major rebase obviously. Just want to make sure what I did is an acceptable approach before I put more time into it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Regular Expression Library
Regular Expression Library provides a searchable database of regular expressions. Users can add, edit, rate, and test regular expressions.
Read more >
re — Regular expression operations — Python 3.11.1 ...
Source code: Lib/re/ This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched ......
Read more >
Regular expressions library (since C++11)
These classes encapsulate a regular expression and the results of matching a regular expression within a target sequence of characters.
Read more >
How to Add the RegEx Library in C++
1. Every application begins with a new project. 2. Code::Blocks displays the Project Build Options dialog box. 3. You see a number of...
Read more >
UI Bakery RegEx Library
RegEx Library - a curated list of useful regular expressions for different programming languages. ... The regular expressions below can be used to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found