Entire HTML is not checked for finding base tag
See original GitHub issueIn the HTML we are using the base tag is set. It also happens that this HTML has huge amount of comment and white space , and base tag is not coming in first 4096 characters.
In the code here - https://github.com/scrapy/scrapy/blob/b8870ee8a10360aaa74298324d97c823b88ec5c6/scrapy/utils/response.py#L27
def get_base_url(response):
"""Return the base url of the given response, joined with the response url"""
if response not in _baseurl_cache:
text = response.text[0:4096]
_baseurl_cache[response] = html.get_base_url(text, response.url,
response.encoding)
return _baseurl_cache[response]
We could find that , in the code above , we are NOT checking for the base tag beyond first 4096 characters. This has failed our crawl. I believe there could not be any hard coding in any code and I request to make this configurable atleast.
We are stuck with this , please advice what needs to be done.
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (12 by maintainers)
Top Results From Across the Web
How to solve a problem with base tag? - Stack Overflow
When I remove the base tag, it works correctly but my site doesn't show any images. I use smarty template engine. How can...
Read more >The Document Base URL element - HTML - MDN Web Docs
The HTML element specifies the base URL to use for all relative URLs in a document. There can be only one element in...
Read more >How to use the HTML <base> tag - Perishable Press
When every byte counts, you can use the HTML <base> tag to specify a default href and target attribute for all relative URLs...
Read more >How to the Fix Base URL Malformed or Empty Issue
The situation occurs if there is an error in the base tag. It may be caused by incorrect formatting of the base URL,...
Read more >Audisto <base> Error Checker
We suggest that you not use the HTML base tag at all. Remove it if possible. If the base tag is removed, all...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @kmike, there are several places where the function
scrapy.utils.response.get_base_url
is used. For example to perform aurljoin
on aResponse
this method will be called internally.https://github.com/scrapy/scrapy/blob/2371a2a0dfbdc535bbe88ae68e986d35063653bf/scrapy/http/response/text.py#L82
But for example in the
scrapy.linkextractors.regex.RegexLinkExtractor
it uses theget_base_url
fromw3lib.html
without cutting the first 4096 characters.https://github.com/scrapy/scrapy/blob/4a93be4ad8a3f6531bf39cb0a9a7d068e521f5ae/scrapy/linkextractors/regex.py#L35
The
FormRequest
is also using theget_base_url
from theutils.response
https://github.com/scrapy/scrapy/blob/73668ce4076b87d2d2493f2c9b445c643da9055a/scrapy/http/request/form.py#L73
For me it’s not clear why some places are using one function or another. But I can’t also tell why we can’t change the implementation to always use the
w3lib.html
.Other places where the function is used can be found here: https://github.com/scrapy/scrapy/search?utf8=✓&q=get_base_url&type=
You can run the benchmark by yourself, I forked the repo here - https://github.com/raphapassini/scrapy-bench/ - just switch to
add-other-benchmarks
branch and runpython xpathbench.py
you can also provide an arbitrary collections of HTML files inside a singletar.gz
archive. To do so use:python xpathbench.py --filename=alexa_pages.tar.gz