createQueryStream loads a big set of data into memory before streaming
See original GitHub issueThe bigQuery.createQueryStream
seems to load an entire set of data into memory before the stream starts actually piping data into the next streams.
Environment details
- OS: MacOS 12.1
- Node.js version: 14.18.1
- npm version: 6.14.15
@google-cloud/bigquery
version: 5.10.0
Steps to reproduce
Using this test script I can see over 300mb of data is loaded into memory before the stream starts piping to the next streams. And I am only selecting one column, so this is a lot of records in that case.
If I log each entry in the transform stream it also seems to come into batches. It pauses for a while and suddenly starts piping again. This makes me think internally a whole page is loaded into memory and then piped to the readable stream, but this might not be the issue.
const stream = bigQuery
.dataset("dataset")
.createQueryStream("SELECT email FROM table");
let checked = false;
const tr = new Transform({
objectMode: true,
transform: (chunk, enc, next) => {
if (!checked) {
console.dir("START PIPING");
console.dir(process.memoryUsage());
console.dir(
"DIFFERENCE = " +
(process.memoryUsage().heapUsed - heapUsed) / (1024 * 1024) +
" MB"
);
checked = true;
}
next(null, JSON.stringify(chunk) + "\n");
},
});
const write = fs.createWriteStream("/dev/null");
console.dir("BEFORE");
console.dir(process.memoryUsage());
const { heapUsed } = process.memoryUsage();
tr.pipe(write);
stream.pipe(tr);
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:5 (3 by maintainers)
Top Results From Across the Web
`createQueryStream` is not truly streaming; seems to be pooling
Use createQueryStream to run a query with a very large payload, in the GB range of results. Put a log in the data...
Read more >BigQuery stream to front-end with Express - Stack Overflow
We ended up building a JSON manually and then "streaming" each row in a very hacky way. But we were trying to stream...
Read more >How to Process Large Files with Node.js - Fusebit
js Streams? The Node.js stream feature makes it possible to process large data continuously in smaller chunks without keeping it all in memory....
Read more >Using Node to Read Really, Really Large Datasets (Pt 1)
Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names....
Read more >Use the legacy streaming API | BigQuery - Google Cloud
This document describes how to stream data into BigQuery by using the legacy tabledata.insertAll method. For new projects, we recommend using the BigQuery ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Is there a way to limit the size of this chunk of data? Because it is causing an OOM exception for us on some tables. Can we use
maxResults
for that or will it limit the total results?The actual
createQueryStream
method doesn’t have this parameter available, as it’s packaging both query creation and use of the streamify API. One option would be to end the stream after a certain amount of data has been emitted and make separate calls of that method until all data has been read. (See an example). For further flexibility, you can also write the logic by hand using various query and results methods with streaming. Feel free to open up a follow up issue if a sample seems helpful! As to incorporating more configurability into the library itself for creating streamified queries, it’ll depend on requests!