Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue scraping Amazon

See original GitHub issue

Hello again!

unfurl.js returned

{
      "title": "Sorry! Something went wrong!",
      "favicon": "https://www.amazon.com/favicon.ico"
}

at one occassion when scraping https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1

now it works better and returns correct metadate

My question is: did all of that come from Amazon? I don’t think that title is something that is coming from unfurl or any associated libraries… so if that’s the case, nothing can be done except to retry, right?

The problem is that I cannot know which data to retry (automatically)… is there a good way to detect such cases based on something else, perhaps possible HTTP code coming from the server ?

I’d like to check for that and not save any metadata like this example above to the database.

thank you

UPDATE:

Amazon returns 503 with HTML which contains above title “Sorry! …”

Issue Analytics

State:
Created 2 years ago
Comments:19 (6 by maintainers)

Top GitHub Comments

1reaction

davidhqcommented, May 31, 2021

Here it is: https://gist.github.com/davidhq/dc097bf6eeeaefee47443cdf5dde9cfa

but it will probably behave differently from your IP (?)

I saved the .html retrieved: https://uniqpath.com/temp/result_unfurl_amazon_test.html

and it doesn’t seem to contain open_graph metadata, so nothing to be done probably…

you can try running the script with the other user agent (mimicking google spider)

const userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';

and either get non 200 (error) or the full response at first.

1reaction

davidhqcommented, May 31, 2021

Quick update:

after changing userAgent from 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' to 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' Amazon again returns 200 from my local IP … other services as well…

BUT

Amazon does not return the complete data even with 200, open_graph and twitter_card are missing.

In any case there might be a chance that some way of permutating user agents, going slower in scraping and sharing the work inside a network there might be a chance that for thousands of links per node (over time) scraping can mostly succeed.

So for this project it seems that using a 2rd party paid service like brightdata or smartly working around it are two possible options… still not sure how to more or less reliably fetch the entire metadata for Amazon, this might be a problem… possibly for others as well… too bad they decide not to return it once they figure out that request is legit. I wonder if they added Twitter and Facebook ASN so that social previews can be generated for sharing to these networks ?