Getting wrong URL when there is dot before url
See original GitHub issueFor this text:
extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")
URL extractor returns:
['claim...https://t.co/SZlazvFzYx']
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
"The resource cannot be found." error when there is a "dot" at ...
NET MVC Beta and I get the HTTP 404 (The resource cannot be found) error when I use this url which has a...
Read more >Is using dots in URL path really a problem? | SEO Forum - Moz
we have a couple of pages displaying a dot in the URL path like domain.com/mr.smith/widget-mr.smith It displays fine in chrome, ...
Read more >Url containg .(Dot) does not resolve. - Optimizely World
I think it should be logged as bug if not before and need a fix. Server Error in '/' Application. The resource cannot...
Read more >Absolute domain names get trailing dot stripped from host ...
curl doesn't honour the domain name part of the redirected URL if it is an absolute name; curl strips the trailing dot.
Read more >Typosquatting - Wikipedia
Typosquatting, also called URL hijacking, a sting site, or a fake URL, is a form of cybersquatting, and possibly brandjacking which relies on...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am looking on this library as a tool that will return you as much domains as it founds even when they are “wrong” and there needs to be some post-processing.
We are trying to cover all general issues without limiting the number of returned results. Of course the results should be correct if possible. But I would rather return this domain that contains other text (e.g. website+www.example.com) rather then returning nothing at all. So users at least can see what was found and tune their parser or do some filtering.
But I would really appreciate any help in any form (discussion on some ideas, PRs, … ).
Thank you!
I have run into many incorrectly extracted URLs because of this issue. What’s more, the dot is not the only problem. It’s also the at sign, colon, plus, etc. With the following input…
find_urls
outputs this list…And there might be more.