BUG: colon in URL gets cut off during call of to_html() when applying format(hyperlinks='html')
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame([['www.google.com:80']])
styler = df.style.format(hyperlinks='html')
print(styler.to_html())
Issue Description
User wants to use format(hyperlinks=‘html’) to automatically convert hyperlinks-styled text (starting with http:, https: or www.) to a hyperlinks. However, the URL contains a colon which gets cut off during the conversion, in both the hyperlink href and the text.
I’ve located the source of error to _render_href
method, more specifically to the regex used. Included is the demonstration of the cutting off using that regex.
Also included is the demonstration of a fix, adding a colon:
Same issue happens with other characters such as # or + (I didn’t check for others)
Expected Behavior
Input string is: ‘www.google.com:80’
Expected output is: www.google.com:80
Current output is: www.google.com:80
Installed Versions
INSTALLED VERSIONS
commit : 06d230151e6f18fdb8139d09abf539867a8cd481 python : 3.8.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Croatian_Croatia.1252
pandas : 1.4.1 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.1 setuptools : 56.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.4.25 tables : None tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
I spent sometime tinkering with the regex and seems that the perfect regex beating every corner case almost doesn’t exist. However, I think adding the aforementioned characters to the last capturing group of the regex, significantly expands its coverage, so I decided to open a PR. I’d be glad if you take a look at it. Btw I’m a total newbie, so I apologize if I’ve handled it too naively.
That would be my solution as well (you beat me to it), with the addition that I would probably add the same characters to the first group as well, perhaps too naively.