question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Which of these methods to limit how fast things get piped?

See original GitHub issue

In the example below, I’m downloading a gzipped file that’s several hundred megabytes (millions of rows), then piping it through highland so I can use the batch feature. This way, instead of inserting 1 row into the db at a time (csvStream emits one ‘data’ event for one row), I can do it for 10000 rows at a time in the ‘data’ event handler you see below.

download.stream(url)
  .pipe(zlib.createGunzip())
  .pipe(csvStream)
  .pipe(highland())
  .batch(10000)
  .on('data', async (data) => {
    await insertIntoDB(data)
  })

But when running this I get out of memory errors and the system starts to slow down significantly. I think it’s because the data is coming in too fast to the ‘data’ event. The csvStream’s finish event happens in a couple of minutes but the program runs for up to another hour, which indicates that the whole csv file has been read into memory, rather than being piped downstream piece by piece as the data event consumes the batches.

I’m new to highland and looking through the documentation I can’t tell which of the various methods would be the most appropriate in this case. http://highlandjs.org/#backpressure seems like it’s most relevant to this situation but I can’t tell how to use it in this code. http://highlandjs.org/#parallel looks good too.

Can I configure highland so that at any time there’s only for example 3 batches (where batch is 10000) worth of rows that have been read? And it only reads another 10000 rows when one of those 3 batches is complete.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:25 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
vqvucommented, Oct 26, 2016

I’m not super familiar with the async functions spec, but I think the code you’ve posted will just execute as many insertIntoDB as necessary to get through your input data, all in parallel. The fact that you await inside your async function doesn’t actually cause it to block the data call. You’d get the same behavior if you passed insertIntoDB to on('data') directly.

You’re right that backpressure is what you want though. In Highland, you get it for free as long as you don’t opt-out. Using on('data') is opting-out. You shouldn’t really need to use the on method ever, really. It’s why we don’t document the events on the website.

To do what you want, you need to use parallel or mergeWithLimit. They’re more or less the same with the exception that parallel preserves order and mergeWithLimit doesn’t. The way you use them is by constructing a stream of streams. What’s usually done is that you use map to perform some asynchronous action, and return a stream that waits for that action to complete and then emits the result. The stream here functions much like a promise. It doesn’t contain any data; it’s just a handle for later getting at your data. Then you call paralel or mergeWithLimit, which will handle the backpressure for you.

For example,

download.stream(url)
  .pipe(zlib.createGunzip())
  .pipe(csvStream)
  .pipe(highland())
  .batch(10000)

  // The result of the compose is equivalent to this arrow function:
  //   data => highland(insertIntoDB(data))
  // For every object, which is 10000 rows, call insertIntoDB, which returns a promise,
  // then wrap the promise in a stream using the highland constructor. You now have a
  // stream of streams.
  .map(highland.compose(highland, insertIntoDB))

  // Merge the stream elements together by consuming them, making sure that only 3 are being
  // consumed at a time. You may, of course, replace 3 with whatever parallelism factor you
  // want.
  .mergeWithLimit(3)

  // Consume the results and execute the callback when done.
  .done(() => {
    console.log('I am done.');
  });

The reason this works is the laziness and backpressure features of Highland. Nothing happens until you call done, which is a consumption operator.

  1. done will ask for data from the mergeWithLimit stage.
  2. mergeWithLimit will ask for streams from the map stage, stopping once it has gotten 3 of them. Once it has finished consuming one of the three, it will ask for more. This is backpressure.
  3. map only executes when mergeWithLimit asks for data. This is laziness. Since mergeWithLimit will only ask for 3 at a time and won’t ask for more until a previous stream (i.e., task) has completed, you will only have 3 tasks in-flight at any one time.
  4. map of course exerts backpressure on the previous stages, which keeps memory usage from blowing up.
1reaction
vqvucommented, Jan 13, 2017

download should return a promise that resolves once the download completes. Both of your options will work. Use the one that you like better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Reduce Friction Loss in Pipe Systems
By straightening out pipe runs and clearing your pipe's path, you can avoid friction loss. Accomplish this by removing tees, fittings, and other ......
Read more >
Pipes | Minecraft buildcraft Wiki - Fandom
All item transporting pipes have a property referred to as "friction"; this measures how much a pipe will slow down items traveling through...
Read more >
Control Heavy Runoff - Solving Drainage and Erosion Problems
The following approaches to redirect and capture runoff can be used to control heavy runoff causing prolonged wet areas or yard erosion.
Read more >
Multiprocessing Pipe in Python
The child process loops, receiving objects from the pipe each iteration. It blocks until an object appears each iteration. Received values are ......
Read more >
Introduction to Snowpipe - Snowflake Documentation
Snowpipe loads data from files as soon as they are available in a stage. The data is loaded according to the COPY statement...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found