Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] After scraping around 800 hashtags Instamancer reloads the browser

See original GitHub issue

Describe the bug When scraping for hashtag’s, recently it seem’s to fail after scraping around ~800 (this is fairly consistent). When reaching around 800 Instamancer restarts the browser and tries again from scratch.

It seems to be related to this line of code: https://github.com/ScriptSmith/instamancer/blob/07e664ea6b144f6d304c4c2cc2f7e957f53fa4f7/src/api/instagram.ts#L419

Specifically the this.start() method which causes the browser to reload.

And by looking at the network logs in chrome I can see that one of the graphql requests returns an error around the 800 post mark. Every other request after this one seems to work ok.

To Reproduce Search for any hashtag, and make sure the limit is higher than 800.

Setup (please complete the following information):

OS: [e.g. MacOS Catalina]
Instamancer version [e.g. v3.0.1]

I will add more info here as I debug the issue further.

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

Daniel-Griffithscommented, Nov 21, 2019

Sorry @ScriptSmith I have not had a chance to try it. I will close this issue for now and reopen if I can get any further info.

1reaction

ScriptSmithcommented, Nov 7, 2019

In my initial attempts to reproduce this, I am able to gather 1000 posts from a hashtag.

The restarting process you describe is what I call grafting, which allows instamancer to perform long scraping jobs by restarting the browser in order to limit resource usage. You can read about it on the website

Because using a browser consumes lots of memory in large scraping jobs, Instamancer employs a new scraping technique called grafting. It intercepts and saves the URL and headers of each request, and then after a certain number of interactions with the page it will restart the browser and navigate back to the same page. Once the page initiates the first request to the API, its URL and headers are swapped on-the-fly with the most recently saved ones. The scraping continues without incident because the response from the API is in the correct form despite being for the incorrect data.

and in the FAQ

What happens if I disable grafting?

Chrome / Chromium will eventually decide that it doesn’t want the page to consume any more resources and future requests to the API will be aborted. This usually happens between 5k-10k posts regardless of the memory available on the system. There doesn’t seem to be any combination of Chrome flags to avoid this.

This bug could be because when instamancer attempts to perform a graft by swapping request parameters on the fly after being restarted, something is going wrong.

So, a few questions: