Issue scraping Amazon
See original GitHub issueHello again!
unfurl.js
returned
{
"title": "Sorry! Something went wrong!",
"favicon": "https://www.amazon.com/favicon.ico"
}
at one occassion when scraping https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1
now it works better and returns correct metadate
My question is: did all of that come from Amazon? I don’t think that title is something that is coming from unfurl or any associated libraries… so if that’s the case, nothing can be done except to retry, right?
The problem is that I cannot know which data to retry (automatically)… is there a good way to detect such cases based on something else, perhaps possible HTTP code coming from the server ?
I’d like to check for that and not save any metadata like this example above to the database.
thank you
UPDATE:
Amazon returns 503 with HTML which contains above title “Sorry! …”
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (6 by maintainers)
Top Results From Across the Web
5 Major Challenges That Make Amazon Data Scraping Painful
Scraping data from Amazon can be difficult. Let us talk about a few issues that we can face with extracting web data from...
Read more >Your step-by-step guide to scraping Amazon product data
Your step-by-step guide to scraping Amazon product data · Step 1. Go to the Amazon Product Scraper page on Apify Store · Step...
Read more >How To Scrape Amazon Product Data
Scrape Data From Amazon Using Scraper API with Python Scrapy · Start Scraping with Scrapy · Create an Amazon Spider · Send a...
Read more >How To Build An Amazon Product Scraper With Node.js
So, if you want loads of data quickly, you'll need a truly powerful scraper. Well, that's enough talk about problems, let's focus on...
Read more >How to scrape Amazon Product Information using Beautiful ...
How to scrape Amazon Product Information using Beautiful Soup ... by clicking the “report an issue“ button at the bottom of the tutorial....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Here it is: https://gist.github.com/davidhq/dc097bf6eeeaefee47443cdf5dde9cfa
but it will probably behave differently from your IP (?)
I saved the .html retrieved: https://uniqpath.com/temp/result_unfurl_amazon_test.html
and it doesn’t seem to contain open_graph metadata, so nothing to be done probably…
you can try running the script with the other user agent (mimicking google spider)
and either get non 200 (error) or the full response at first.
Quick update:
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
to'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
Amazon again returns200
from my local IP … other services as well…BUT
Amazon does not return the complete data even with
200
,open_graph
andtwitter_card
are missing.In any case there might be a chance that some way of permutating user agents, going slower in scraping and sharing the work inside a network there might be a chance that for thousands of links per node (over time) scraping can mostly succeed.
So for this project it seems that using a 2rd party paid service like
brightdata
or smartly working around it are two possible options… still not sure how to more or less reliably fetch the entire metadata for Amazon, this might be a problem… possibly for others as well… too bad they decide not to return it once they figure out that request is legit. I wonder if they added Twitter and Facebook ASN so that social previews can be generated for sharing to these networks ?