question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue scraping Amazon

See original GitHub issue

Hello again!

unfurl.js returned

{
      "title": "Sorry! Something went wrong!",
      "favicon": "https://www.amazon.com/favicon.ico"
}

at one occassion when scraping https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1

now it works better and returns correct metadate

My question is: did all of that come from Amazon? I don’t think that title is something that is coming from unfurl or any associated libraries… so if that’s the case, nothing can be done except to retry, right?

The problem is that I cannot know which data to retry (automatically)… is there a good way to detect such cases based on something else, perhaps possible HTTP code coming from the server ?

I’d like to check for that and not save any metadata like this example above to the database.

thank you

UPDATE:

Amazon returns 503 with HTML which contains above title “Sorry! …”

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:19 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
davidhqcommented, May 31, 2021

Here it is: https://gist.github.com/davidhq/dc097bf6eeeaefee47443cdf5dde9cfa

but it will probably behave differently from your IP (?)

I saved the .html retrieved: https://uniqpath.com/temp/result_unfurl_amazon_test.html

and it doesn’t seem to contain open_graph metadata, so nothing to be done probably…

you can try running the script with the other user agent (mimicking google spider)

const userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';

and either get non 200 (error) or the full response at first.

1reaction
davidhqcommented, May 31, 2021

Quick update:

  • after changing userAgent from 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' to 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' Amazon again returns 200 from my local IP … other services as well…

BUT

Amazon does not return the complete data even with 200, open_graph and twitter_card are missing.

In any case there might be a chance that some way of permutating user agents, going slower in scraping and sharing the work inside a network there might be a chance that for thousands of links per node (over time) scraping can mostly succeed.

So for this project it seems that using a 2rd party paid service like brightdata or smartly working around it are two possible options… still not sure how to more or less reliably fetch the entire metadata for Amazon, this might be a problem… possibly for others as well… too bad they decide not to return it once they figure out that request is legit. I wonder if they added Twitter and Facebook ASN so that social previews can be generated for sharing to these networks ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

5 Major Challenges That Make Amazon Data Scraping Painful
Scraping data from Amazon can be difficult. Let us talk about a few issues that we can face with extracting web data from...
Read more >
Your step-by-step guide to scraping Amazon product data
Your step-by-step guide to scraping Amazon product data · Step 1. Go to the Amazon Product Scraper page on Apify Store · Step...
Read more >
How To Scrape Amazon Product Data
Scrape Data From Amazon Using Scraper API with Python Scrapy · Start Scraping with Scrapy · Create an Amazon Spider · Send a...
Read more >
How To Build An Amazon Product Scraper With Node.js
So, if you want loads of data quickly, you'll need a truly powerful scraper. Well, that's enough talk about problems, let's focus on...
Read more >
How to scrape Amazon Product Information using Beautiful ...
How to scrape Amazon Product Information using Beautiful Soup ... by clicking the “report an issue“ button at the bottom of the tutorial....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found