question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

createQueryStream loads a big set of data into memory before streaming

See original GitHub issue

The bigQuery.createQueryStream seems to load an entire set of data into memory before the stream starts actually piping data into the next streams.

Environment details

  • OS: MacOS 12.1
  • Node.js version: 14.18.1
  • npm version: 6.14.15
  • @google-cloud/bigquery version: 5.10.0

Steps to reproduce

Using this test script I can see over 300mb of data is loaded into memory before the stream starts piping to the next streams. And I am only selecting one column, so this is a lot of records in that case.

If I log each entry in the transform stream it also seems to come into batches. It pauses for a while and suddenly starts piping again. This makes me think internally a whole page is loaded into memory and then piped to the readable stream, but this might not be the issue.

const stream = bigQuery
  .dataset("dataset")
  .createQueryStream("SELECT email FROM table");

let checked = false;
const tr = new Transform({
  objectMode: true,
  transform: (chunk, enc, next) => {
    if (!checked) {
      console.dir("START PIPING");
      console.dir(process.memoryUsage());
      console.dir(
        "DIFFERENCE = " +
          (process.memoryUsage().heapUsed - heapUsed) / (1024 * 1024) +
          " MB"
      );
      checked = true;
    }
    next(null, JSON.stringify(chunk) + "\n");
  },
});

const write = fs.createWriteStream("/dev/null");

console.dir("BEFORE");
console.dir(process.memoryUsage());
const { heapUsed } = process.memoryUsage();

tr.pipe(write);
stream.pipe(tr);

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rohit-gohricommented, May 17, 2022

Is there a way to limit the size of this chunk of data? Because it is causing an OOM exception for us on some tables. Can we use maxResults for that or will it limit the total results?

0reactions
loferriscommented, Aug 10, 2022

The actual createQueryStream method doesn’t have this parameter available, as it’s packaging both query creation and use of the streamify API. One option would be to end the stream after a certain amount of data has been emitted and make separate calls of that method until all data has been read. (See an example). For further flexibility, you can also write the logic by hand using various query and results methods with streaming. Feel free to open up a follow up issue if a sample seems helpful! As to incorporating more configurability into the library itself for creating streamified queries, it’ll depend on requests!

Read more comments on GitHub >

github_iconTop Results From Across the Web

`createQueryStream` is not truly streaming; seems to be pooling
Use createQueryStream to run a query with a very large payload, in the GB range of results. Put a log in the data...
Read more >
BigQuery stream to front-end with Express - Stack Overflow
We ended up building a JSON manually and then "streaming" each row in a very hacky way. But we were trying to stream...
Read more >
How to Process Large Files with Node.js - Fusebit
js Streams? The Node.js stream feature makes it possible to process large data continuously in smaller chunks without keeping it all in memory....
Read more >
Using Node to Read Really, Really Large Datasets (Pt 1)
Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names....
Read more >
Use the legacy streaming API | BigQuery - Google Cloud
This document describes how to stream data into BigQuery by using the legacy tabledata.insertAll method. For new projects, we recommend using the BigQuery ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found