Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FormRequest.from_response takes forever.

See original GitHub issue

Description

I have created a simple spider which crawls a website and performs both requests GET and POST, but after few POST requests it stucks forever. I 'm running the scrapy using a script by importing CrawlerProcess.

crawler code.

    def parse_form(self, response):
        url = response.url
        status = response.status
        request = response.request
        post_data = request.body.decode("utf-8")
        redirect_urls = request.meta.get("redirect_urls", [])
        req_url = ""
        if redirect_urls:
            req_url = redirect_urls[0]
            logger.info(f"Crawled: {req_url}")
        method = response.request.method
        body = response.text
        mobj = search_regex(
            pattern=(
                r"(?is)(?:PATTERN)",
            ),
            string=body,
            group="ref",
            default={},
        )
        if mobj:
            yield {
                "response_url": url,
                "status": status,
                "method": method,
                "request_url": req_url,
                "post_data": post_data,
                "redirect_urls": redirect_urls,
            }

    def parse(self, response):
        """Parse responses."""
        url = response.url
        status = response.status
        parsed = urlparse.urlparse(url)
        forms = extract_forms(response, scrapy)
        path = f"{parsed.scheme}://{parsed.netloc}{parsed.path if parsed.path else ''}"
        body = response.text
        cookies = headers_to_dict(response.request, response)
        for entry in forms:
            _id = entry.get("id")
            logger.debug(f"sending form with formid: '{_id}'..")
            payloads = entry.get("payloads", [])
            for payload in payloads:
                logger.payload(f"    * parameters: '{payload}'")
                yield scrapy.FormRequest.from_response(
                    response,
                    meta=self.meta,
                    headers=self.headers,
                    cookies=self.cookies,
                    formid=_id,
                    formdata=payload,
                    callback=self.parse_form,
                    errback=self.parse_error,
                )
        fields = response.xpath(
            "//input[re:test(@type, '(?:text|hidden|password|checkbox|search)', 'i')]"
        )
        _params = input_params_extractor(fields)
        max_url = url if len(url) <= 105 else f"'{''.join(url[0:105])}....'"
        logger.info(f"Crawled: {max_url}")
        parameters = extract_params(body, url)
        if _params:
            parameters.extend(_params)
        if parameters:
            logger.info(f"    * found '{len(parameters)}' parameters.")
        if not self._limited:
            links = self._link_extractor.extract_links(response)
            for link in links:
                logger.debug(f"    * Crawling: {link.url}")
                if self._use_splash:
                    yield SplashRequest(
                        url=link.url,
                        meta=self.meta,
                        headers=self.headers,
                        cookies=self.cookies,
                        callback=self.parse,
                        errback=self.parse_error,
                        cache_args=["lua_source"],
                        args={
                            "wait": self.wait,
                            "lua_source": self._lua_script,
                            "timeout": 90,
                            "images": 0,
                            "resource_timeout": 10,
                        },
                        endpoint=self._splash_endpoint,
                        splash_url=self._splash_url,
                        slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN,
                    )
                else:
                    yield scrapy.Request(
                        url=link.url,
                        meta=self.meta,
                        headers=self.headers,
                        cookies=self.cookies,
                        callback=self.parse,
                        errback=self.parse_error,
                    )
        else:
            logger.debug(
                f"crawler is done with processing '{url}'.. going to stop as limited crawl switch is specified.."
            )

        if parameters:
            yield {
                "url": url,
                "path": path,
                "status": status,
                "parameters": parameters,
                "forms": forms,
                "cookies": cookies,
            }

Steps to Reproduce

used this url https://www.lider.cl/supermercado/ (this has a lot of forms in home page)
when you start the crawler it will crawl few and then stops on one post request forever.

Expected behavior: [What you expect to happen] Should perform posts requests normally.

Actual behavior: [What actually happens] halts forever and i have to kill the process.

Reproduces how often: [What percentage of the time does it reproduce?] everytime i run it against the above url mention.

Versions

(hsenv) λ scrapy version --verbose
Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.5
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:23:07) [MSC v.1927 32 bit (Intel)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.3.1
Platform     : Windows-8.1-6.3.9600-SP0

Would be great if you guys can assist me if i 'm doing something wrong or what?

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

wRARcommented, Feb 16, 2021

Can you provide a minimal reproducible code sample that we can check?

0reactions

r0oth3x49commented, Feb 17, 2021

As Andrey points out, the posted example is definitely not minimal - it’s over 450 lines long. Please take a look at this guide. In this case, you should strip your example of everything that is not necessary to reproduce the issue. In addition, during this simplification process you might even find the cause of the problem by yourself.

i found the cause of the problem and fixed it thank you, the issue was on my side not scrapy, actually the regex was causing the issue because of wildcard, i fixed that and now everything is working fine.

Thank you so much for the suggestion.

Top Results From Across the Web

Scrapy user login not working with FormRequest ...

For some hours I tried to login with Scrapy FormRequest.from_response() into the following website by using the local signin (second one): ...

Requests and Responses — Scrapy 2.7.1 documentation

Using WeakKeyDictionary saves memory by ensuring that request objects do not stay in memory forever just because you have references to them in ......

FormRequest.from_response() doesn't always use method ...

FormRequest (response) defaults to method of POST in form.py so I cannot figure out why in some cases form_response() actually sends a GET: ......

Scrapy FormRequest from response AtrributeError: 'str' object ...

I am trying to login to Facebook using Scrapy. I have identified that mobile version of Facebook works without javascript, so I am...

Release notes — Scrapy 2.5.0 documentation - Read the Docs

Covered how to deal with long lists of allowed domains in the FAQ. ... FormRequest.from_response() now handles invalid methods like major web browsers ......