Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Got garbled code on some website

See original GitHub issue

I can get data correctly via using scrapy-playwright in my code on most website like this:

def start_requests(self):
    # my code
    # ......

def parse(self, response):
    page = response.meta["playwright_page"]
    response_text = await page.content()
    with open(self.config.task_name+'.html', 'w', encoding='utf-8') as f:
        f.write(response_text)
    # my code
    # ......

But I failed to fetch the right data on another website. (It is actually a rare problem)

I use the code above to save page.content(), it shows like this:

So it is completely not readable. I don’t know how to solve this problem.

Here is my server info:

system version: centos 7.6
Scrapy: 2.5.1
scrapy-playwright: 0.0.5
playwright: 1.16.0
...

Looking forward to some solutions 😃

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

Alienboypluscommented, May 13, 2022

I assumed the absence of the async keyword in your first example was just a typo, if it wasn’t and you actually did not have it before I don’t know how the code was running in the first place - using await inside a regular function raises SyntaxError: 'await' outside async function.

You’re using scrapy-playwright==0.0.5, I recommend you to update to a more recent version (latest one is 0.0.15), as there have been some fixes related to the encoding of response bodies.

If the problem persists, please report whether or not you get the same results with the standalone playwright script I posted earlier. If you do, you’re having a problem with upstream Playwright and should report it there.

I have just solved this problem, and i did these things:

I reinstalled my server system with centos 7.6, and then try to use my .sh file to set up the environment related to my project.
I checked my .sh file again and found some useless python packages such as selenium_wire. So this time, I didn’t pip install them.
Last year when I try to run my project with scrapy-playwright, an exception showed up and said I need to install some dependencies like at-spi2-atk, libxkbcommon-x11-devel and glibc-2.18. So I added these to my .sh file. This time, I didn’t install those dependencies, and it works correctly (scrapy-playwright==0.0.5 still works)!

So I believe it is just a bug related to the server’s environment. I can get the right data from my target website now. Thanks !

0reactions

elacuestacommented, May 11, 2022

I assumed the absence of the async keyword in your first example was just a typo, if it wasn’t and you actually did not have it before I don’t know how the code was running in the first place - using await inside a regular function raises SyntaxError: 'await' outside async function.
You’re using scrapy-playwright==0.0.5, I recommend you to update to a more recent version (latest one is 0.0.15), as there have been some fixes related to the encoding of response bodies.
If the problem persists, please report whether or not you get the same results with the standalone playwright script I posted earlier. If you do, you’re having a problem with upstream Playwright and should report it there.