question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is there a hard limit on maximum number of pages that Gatsby can build?

See original GitHub issue

I’m trying to build a site with ~150k pages (probably more than this when I get closer to finishing) using Gatsby with a CSV file as data source. I initially had a sample dataset with about 100 rows in a CSV file and developed my initial pages and it worked. When I tried running gatsby build with all 150k rows, build was getting stuck in “source and transform nodes” step.

As suggested by @KyleAMathews, I split the large CSV into multiple files (varied number of rows based on data) and the build now finishes “source and transform nodes” in about 100s, but fails with heap out of memory error.

I also tried running the create pages benchmark site with 125k pages and it fails with the same error too, while it builds the site in less than 2 minutes for 100k pages.

I tried figuring out the underlying issue myself. From page creation docs, I reached pages reducer and found that we use JavaScript Map for the state.

I was wondering if there’s a hard limit on the number of items that can be set in a Map. From this StackOverflow answer, it looks like we can set only upto 2^24 (roughly 167k) items in a Map. I’m not very sure about what else does this redux state have, but if it’s storing only the pages, does ~167k become a hard limit for the number of pages that Gatsby can build?

There’s a lot of places where we use Map in Gatsby source code. It’s probably one of them causing this out of memory error?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:23 (23 by maintainers)

github_iconTop GitHub Comments

17reactions
pvdzcommented, Jan 14, 2020

You might think I have forgotten about this.

But you’d be wrong.

And happy.

Debugging the problem in this build turned out to be a deep rabbit hole and it took me some time to get in, and out of it. But, happy to report I can build your site in ~10 minutes now.

success Building production JavaScript and CSS bundles - 6.687s
success Rewriting compilation hashes - 0.007s
success run queries - 239.751s - 145178/145178 605.54/s
success Building static HTML for pages - 113.521s - 145178/145178 1278.87/s
info Done building in 654.383649061 sec

You’ll have to wait a bit before you can do this but there are some fixes / workarounds upcoming.

The basic gist is that the way nodes are looked up have a shortcut for querying by id. Unfortunately this heuristic is not optimal and fails to hit the mark in your case. That led to a bunch of other things and will need to be fixed on Gatsby’s side.

After that, the run queries step drops to ~10 minutes (down from 257 minutes, or 4.2 hours, as you can see above). Which makes me very happy :d

The wait for you is now for me to polish this fix, make sure the generic assumptions hold (is your site a one-of or are most sites like yours?) and then we should be good to go.

6reactions
pvdzcommented, Jan 15, 2020

Now https://github.com/gatsbyjs/gatsby/pull/20609 has landed in master. This is the part from us you’ll need to see improvements. (Still needs to be published so if you’re not comfortable to build from source it usually doesn’t take long to get published).

The other change is to your repo. It’s changing the index from slug to id:

src/templates/ifsc.tsx:

export const query = graphql`
-  query($slug: String!) {
-    allIfscCsv(filter: { fields: { slug: { eq: $slug } } }) {
+  query($id: String!) {
+    allIfscCsv(filter: { id: { eq: $id } }) {
       edges {
         node {
           ifsc

gatsby-node.js

const result = await graphql(`
     query {
       allIfscCsv {
         edges {
           node {
+            id
             fields {
               slug
             }

and later in that file

   result.data.allIfscCsv.edges.forEach(({ node }) => {
     createPage({
       path: node.fields.slug,
       component: path.resolve(`./src/templates/ifsc.tsx`),
       context: {
-        slug: node.fields.slug
+        id: node.id,
       }
     });
   });

I think that should suffice.

With that, the run queries step should take roughly 5 minutes on Gatsby master.


If you want counting stats while building for your pages (hey that’s 60 seconds less of looking at an idle screen) you can copy paste my whole change, which will use a progress bar for the createPages step (this is gatsby-config again);

-exports.createPages = async ({ graphql, actions }) => {
+exports.createPages = async ({ graphql, actions, reporter }) => {
+  const progress = reporter.createProgress(`ifsc/gatsby-node.js`);
+  console.time("(ifsc) total exports.createPages");
+  console.time("(ifsc) initial graphql query");
+  progress.setStatus("initial graphl query");
+
   const { createPage } = actions;
   const result = await graphql(`
     query {
       allIfscCsv {
         edges {
           node {
+            id  
             fields {
               slug
             }
@@ -36,13 +42,38 @@ exports.createPages = async ({ graphql, actions }) => {
       }
     }
   `);
+  console.timeEnd("(ifsc) initial graphql query");
+
+  console.time("(ifsc) created pages");
+
+  progress.start();
+  progress.total = result.data.allIfscCsv.edges.length - 1;
+  let start = Date.now();
+  progress.setStatus(
+    "Calling createPage for " + result.data.allIfscCsv.edges.length + " pages"
+  );
   result.data.allIfscCsv.edges.forEach(({ node }) => {
     createPage({
       path: node.fields.slug,
       component: path.resolve(`./src/templates/ifsc.tsx`),
       context: {
-        slug: node.fields.slug
+        id: node.id,
+        // slug: node.fields.slug
       }
     });
+    progress.tick(1);
   });
+  progress.setStatus(
+    "Called createPage for " +
+      (result.data.allIfscCsv.edges.length - 1) +
+      " pages at " +
+      (result.data.allIfscCsv.edges.length - 1) /
+        ((Date.now() - start) / 1000) +
+      " pages/s"
+  );
+  progress.done();
+  console.timeEnd("(ifsc) created pages");
+  console.timeEnd("(ifsc) total exports.createPages");
+  progress.setStatus("createPages finished");
 };
Read more comments on GitHub >

github_iconTop Results From Across the Web

Platform Limits - Gatsby Cloud
Platform Limits ; Disk Storage ; Requests, 1,000,000 / month, Up to 10,000,000 / month ; Cloud Builds Timeout Period, 1 hour, 4...
Read more >
Best Analysis: Money and Materialism in The Great Gatsby
In The Great Gatsby, money is a huge motivator in the characters' relationships, motivations, and outcomes. Most of the characters reveal ...
Read more >
The Failure of the American Dream: The Great Gatsby
To some, it means that if you work hard, you can achieve more prosperity and it will take you to your goals. To...
Read more >
Peter van der Zee on Twitter: "Managed to reduce the Gatsby build ...
Managed to reduce the Gatsby build time of a 150k page site from 4.5 hours to . ... Is there a hard limit...
Read more >
The Great Gatsby: Summary & Analysis Chapter 1 | CliffsNotes
As he tries to make his way as a bond salesman, he rents a small house next door to a mansion which, it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found