Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Entire HTML is not checked for finding base tag

See original GitHub issue

In the HTML we are using the base tag is set. It also happens that this HTML has huge amount of comment and white space , and base tag is not coming in first 4096 characters.

In the code here - https://github.com/scrapy/scrapy/blob/b8870ee8a10360aaa74298324d97c823b88ec5c6/scrapy/utils/response.py#L27

def get_base_url(response):
    """Return the base url of the given response, joined with the response url"""
    if response not in _baseurl_cache:
        text = response.text[0:4096]
        _baseurl_cache[response] = html.get_base_url(text, response.url,
            response.encoding)
    return _baseurl_cache[response]

We could find that , in the code above , we are NOT checking for the base tag beyond first 4096 characters. This has failed our crawl. I believe there could not be any hard coding in any code and I request to make this configurable atleast.

We are stuck with this , please advice what needs to be done.

Issue Analytics

State:
Created 6 years ago
Comments:14 (12 by maintainers)

Top GitHub Comments

2reactions

raphapassinicommented, Dec 5, 2017

Hey @kmike, there are several places where the function scrapy.utils.response.get_base_url is used. For example to perform a urljoin on a Response this method will be called internally.

https://github.com/scrapy/scrapy/blob/2371a2a0dfbdc535bbe88ae68e986d35063653bf/scrapy/http/response/text.py#L82

But for example in the scrapy.linkextractors.regex.RegexLinkExtractor it uses the get_base_url from w3lib.html without cutting the first 4096 characters.

https://github.com/scrapy/scrapy/blob/4a93be4ad8a3f6531bf39cb0a9a7d068e521f5ae/scrapy/linkextractors/regex.py#L35

The FormRequest is also using the get_base_url from the utils.response

https://github.com/scrapy/scrapy/blob/73668ce4076b87d2d2493f2c9b445c643da9055a/scrapy/http/request/form.py#L73

For me it’s not clear why some places are using one function or another. But I can’t also tell why we can’t change the implementation to always use the w3lib.html.

Other places where the function is used can be found here: https://github.com/scrapy/scrapy/search?utf8=✓&q=get_base_url&type=

1reaction

raphapassinicommented, Dec 8, 2017

Benchmark the extraction of the base url using xpath 

HTML is already parsed = False
Total parsed files = 343
Time taken: 1.9697003246983513

Benchmark the get_base_url function when the HTML is already
    parsed 

HTML is already parsed = True
Total parsed files = 343
Time taken: 1.7726598485605791

Benchmark the current implementation of get_base_url in
    scrapy.utils.response.get_base_url 

Total parsed files = 343
Time taken: 0.06588769040536135

You can run the benchmark by yourself, I forked the repo here - https://github.com/raphapassini/scrapy-bench/ - just switch to add-other-benchmarks branch and run python xpathbench.py you can also provide an arbitrary collections of HTML files inside a single tar.gz archive. To do so use: python xpathbench.py --filename=alexa_pages.tar.gz