question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Entire HTML is not checked for finding base tag

See original GitHub issue

In the HTML we are using the base tag is set. It also happens that this HTML has huge amount of comment and white space , and base tag is not coming in first 4096 characters.

In the code here - https://github.com/scrapy/scrapy/blob/b8870ee8a10360aaa74298324d97c823b88ec5c6/scrapy/utils/response.py#L27

def get_base_url(response):
    """Return the base url of the given response, joined with the response url"""
    if response not in _baseurl_cache:
        text = response.text[0:4096]
        _baseurl_cache[response] = html.get_base_url(text, response.url,
            response.encoding)
    return _baseurl_cache[response]

We could find that , in the code above , we are NOT checking for the base tag beyond first 4096 characters. This has failed our crawl. I believe there could not be any hard coding in any code and I request to make this configurable atleast.

We are stuck with this , please advice what needs to be done.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:14 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
raphapassinicommented, Dec 5, 2017

Hey @kmike, there are several places where the function scrapy.utils.response.get_base_url is used. For example to perform a urljoin on a Response this method will be called internally.

https://github.com/scrapy/scrapy/blob/2371a2a0dfbdc535bbe88ae68e986d35063653bf/scrapy/http/response/text.py#L82

But for example in the scrapy.linkextractors.regex.RegexLinkExtractor it uses the get_base_url from w3lib.html without cutting the first 4096 characters.

https://github.com/scrapy/scrapy/blob/4a93be4ad8a3f6531bf39cb0a9a7d068e521f5ae/scrapy/linkextractors/regex.py#L35

The FormRequest is also using the get_base_url from the utils.response

https://github.com/scrapy/scrapy/blob/73668ce4076b87d2d2493f2c9b445c643da9055a/scrapy/http/request/form.py#L73

For me it’s not clear why some places are using one function or another. But I can’t also tell why we can’t change the implementation to always use the w3lib.html.

Other places where the function is used can be found here: https://github.com/scrapy/scrapy/search?utf8=✓&q=get_base_url&type=

1reaction
raphapassinicommented, Dec 8, 2017
Benchmark the extraction of the base url using xpath 

HTML is already parsed = False
Total parsed files = 343
Time taken: 1.9697003246983513

Benchmark the get_base_url function when the HTML is already
    parsed 

HTML is already parsed = True
Total parsed files = 343
Time taken: 1.7726598485605791

Benchmark the current implementation of get_base_url in
    scrapy.utils.response.get_base_url 

Total parsed files = 343
Time taken: 0.06588769040536135

You can run the benchmark by yourself, I forked the repo here - https://github.com/raphapassini/scrapy-bench/ - just switch to add-other-benchmarks branch and run python xpathbench.py you can also provide an arbitrary collections of HTML files inside a single tar.gz archive. To do so use: python xpathbench.py --filename=alexa_pages.tar.gz

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve a problem with base tag? - Stack Overflow
When I remove the base tag, it works correctly but my site doesn't show any images. I use smarty template engine. How can...
Read more >
The Document Base URL element - HTML - MDN Web Docs
The HTML element specifies the base URL to use for all relative URLs in a document. There can be only one element in...
Read more >
How to use the HTML <base> tag - Perishable Press
When every byte counts, you can use the HTML <base> tag to specify a default href and target attribute for all relative URLs...
Read more >
How to the Fix Base URL Malformed or Empty Issue
The situation occurs if there is an error in the base tag. It may be caused by incorrect formatting of the base URL,...
Read more >
Audisto <base> Error Checker
We suggest that you not use the HTML base tag at all. Remove it if possible. If the base tag is removed, all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found