Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

createQueryStream loads a big set of data into memory before streaming

See original GitHub issue

The bigQuery.createQueryStream seems to load an entire set of data into memory before the stream starts actually piping data into the next streams.

Environment details

OS: MacOS 12.1
Node.js version: 14.18.1
npm version: 6.14.15
@google-cloud/bigquery version: 5.10.0

Steps to reproduce

Using this test script I can see over 300mb of data is loaded into memory before the stream starts piping to the next streams. And I am only selecting one column, so this is a lot of records in that case.

If I log each entry in the transform stream it also seems to come into batches. It pauses for a while and suddenly starts piping again. This makes me think internally a whole page is loaded into memory and then piped to the readable stream, but this might not be the issue.

const stream = bigQuery
  .dataset("dataset")
  .createQueryStream("SELECT email FROM table");

let checked = false;
const tr = new Transform({
  objectMode: true,
  transform: (chunk, enc, next) => {
    if (!checked) {
      console.dir("START PIPING");
      console.dir(process.memoryUsage());
      console.dir(
        "DIFFERENCE = " +
          (process.memoryUsage().heapUsed - heapUsed) / (1024 * 1024) +
          " MB"
      );
      checked = true;
    }
    next(null, JSON.stringify(chunk) + "\n");
  },
});

const write = fs.createWriteStream("/dev/null");

console.dir("BEFORE");
console.dir(process.memoryUsage());
const { heapUsed } = process.memoryUsage();

tr.pipe(write);
stream.pipe(tr);

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

rohit-gohricommented, May 17, 2022

Is there a way to limit the size of this chunk of data? Because it is causing an OOM exception for us on some tables. Can we use maxResults for that or will it limit the total results?

0reactions

loferriscommented, Aug 10, 2022

The actual createQueryStream method doesn’t have this parameter available, as it’s packaging both query creation and use of the streamify API. One option would be to end the stream after a certain amount of data has been emitted and make separate calls of that method until all data has been read. (See an example). For further flexibility, you can also write the logic by hand using various query and results methods with streaming. Feel free to open up a follow up issue if a sample seems helpful! As to incorporating more configurability into the library itself for creating streamified queries, it’ll depend on requests!

Top Results From Across the Web

`createQueryStream` is not truly streaming; seems to be pooling

Use createQueryStream to run a query with a very large payload, in the GB range of results. Put a log in the data...

BigQuery stream to front-end with Express - Stack Overflow

We ended up building a JSON manually and then "streaming" each row in a very hacky way. But we were trying to stream...

How to Process Large Files with Node.js - Fusebit

js Streams? The Node.js stream feature makes it possible to process large data continuously in smaller chunks without keeping it all in memory....

Using Node to Read Really, Really Large Datasets (Pt 1)

Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names....

Use the legacy streaming API | BigQuery - Google Cloud

This document describes how to stream data into BigQuery by using the legacy tabledata.insertAll method. For new projects, we recommend using the BigQuery ......