Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting wrong URL when there is dot before url

See original GitHub issue

For this text: extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")

URL extractor returns: ['claim...https://t.co/SZlazvFzYx']

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

lipojacommented, Jun 16, 2021

I am looking on this library as a tool that will return you as much domains as it founds even when they are “wrong” and there needs to be some post-processing.

We are trying to cover all general issues without limiting the number of returned results. Of course the results should be correct if possible. But I would rather return this domain that contains other text (e.g. website+www.example.com) rather then returning nothing at all. So users at least can see what was found and tune their parser or do some filtering.

But I would really appreciate any help in any form (discussion on some ideas, PRs, … ).

Thank you!

1reaction

Larraxcommented, Jun 24, 2019

I have run into many incorrectly extracted URLs because of this issue. What’s more, the dot is not the only problem. It’s also the at sign, colon, plus, etc. With the following input…

Visit us @www.example.com
Visit our website:www.example.com
Visit our website-www.example.com
Visit our website*www.example.com
Visit our website+www.example.com
Visit our website...www.example.com
Nonsense URL = '.example.com'

find_urls outputs this list…

@www.example.com
website:www.example.com
website-www.example.com
website*www.example.com
website+www.example.com
website...www.example.com
.example.com

And there might be more.