Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't scrape Instagram.com?

See original GitHub issue

Subject of the issue

Using chrome’s selector to scrape data on Instagram’s website yields nothing.

Your environment

version of node: v9.11.2
version of npm: 5.6.0

Steps to reproduce

copy the selector of an element in Chrome’s dev tools.

function getInstagramFollowers(username) {
  let url = `https://www.instagram.com/${username}/`;
  let selector = '#react-root > section > main > div > ul > li:nth-child(2) > span > span';

  x(url, selector)((err, count) => {
    console.log('COUNT', count)
  })
}

getInstagramFollowers('facebook')

Expected behaviour

Get the number of followers back

Actual behaviour

count is an empty string

If you set the selector to just ‘html’, you get back what appears to be the raw js used before React.

Issue Analytics

State:
Created 5 years ago
Comments:5

Top GitHub Comments

2reactions

levibuzoliccommented, Jun 28, 2018

@MiLeung instagram.com is a client-side React app, which means the HTML isn’t present in the request that comes from the server, instead the HTML is constructed using JavaScript. If you take a look at the HTML source of the site you’ll see the only plain HTML inside the <body /> is <span id="react-root"></span>.

This means you’ll need to use a driver that understands JavaScript – you could try x-ray-phantom which will require phantomjs be installed on your computer/server.

However the data you’re looking for (follower count) is still available elsewhere in the HTML source of the page.

For example, in the header you’ll find:

<meta property="og:description" content="3m Followers, 9 Following, 315 Posts - See Instagram photos and videos from Facebook (@facebook)" />

Which could be selected and parsed.

However an even better source of data would be the <script> tag in the <body> which contains the initial data object used by their React app to render. We can use x-ray or just about anything to reach in and grab that data.

const XRay = require('x-ray');
const x = XRay();

const url = 'https://instagram.com/facebook';

x(url, 'body script@html').then(res => {
  // First strip variable declaration
  res = res.replace('window._sharedData = ', '');

  // Next strip the trailing semi-colon as that's not valid JSON
  res = res.replace(/;$/, '');

  // Now we parse the string as JSON
  const data = JSON.parse(res);

  // Now we deeply select the user object from the data
  const user = data.entry_data.ProfilePage[0].graphql.user;

  // And console log just the follower count
  // however there's heaps of useful data in the user object
  console.log(user.edge_followed_by.count);
});

You can see a working online example here: https://repl.it/@levibuzolic/x-ray-instagram-followers

This whole thing of course is pretty brittle and like scraping of any website relies on Instagram not changing their HTML or JS data structure for it to be able to continue working. Instagram has an API you could just use, or there’s 3rd party sites/tools that will get this data for you and they’ll take on the burden of keeping their service working for you.

1reaction

levibuzoliccommented, Jun 29, 2018

@MiLeung while not specific to React, the fact you should be able to tell you’re dealing with a client side app by looking at the difference between the HTML that comes back in the request (view source) vs the HTML that’s present after JS has run (inspect elements).