question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi,

I’m using elasticsearchJS to export a whole index from ES in batches of 4096. The whole tool uses about 500mb RAM while dumping ES index to parquet format. (nodeJS has 2GB memory limit set)

If i lower or increase the batch size (or randomly) it uses a lot of memory like 2-3GB and it gets killed. The quickest way to reproduce is to increase the batch size that it has to process. The generate parquet file, usually has ~5.4GB.

Is there anything i can do to debug this more?

Thanks!

P.S.: I’m using git+ssh://git@github.com/ironSource/parquetjs.git#1fa58b589d9b6451379f1558214e9ae751909596 as the parquetJS package.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
arnabguptadevcommented, Jun 1, 2020

I think I hit this same issue. Will try to post a working sample to demonstrate, but here’s what I did:

  • Tried to write 20 rows.
  • Set row group sizes to different values: 3/5/7/50
  • Each time the output size is different by wide amounts
  • Writes more lines than expected (validated in Athena using count(*))

Debugging through the library it seems that if the flushing happens only inside the close method (here: https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L108)- you get everything fine and the smallest output.

But if - due to your row group size - it is triggered also in append (here: https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L96) then you end up with duplicate rows. For large amounts of rows that continues to build up till it blows memory.

I tried the following as a quick and dirty workaround and seems to work: I changed the above lines in writer.js to:

      let to_write = this.rowBuffer;
      let rowCount = this.rowBuffer.rowCount;
      this.rowBuffer.rowCount = 0;
      this.rowBuffer = {};
      to_write.rowCount = rowCount;
      await this.envelopeWriter.writeRowGroup(to_write);

With this it does seem to keep the count integrity in place.

I think having the await before resetting the buffer (this.rowBuffer = {};) is the issue.

Does this sound right?

Regards, Arnab.

0reactions
asmuthcommented, Jun 16, 2020

Closing this as resolved.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak - Wikipedia
In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in...
Read more >
What is Memory Leak? How can we avoid? - GeeksforGeeks
Memory leak occurs when programmers create a memory in heap and forget to delete it. The consequences of memory leak is that it...
Read more >
Memory Leaks and Garbage Collection | Computerworld
DEFINITION A memory leak is the gradual deterioration of system performance that occurs over time as the result of the fragmentation of a...
Read more >
Java Memory Leaks: Solutions, Tools, Tutorials & More - Stackify
We put together this guide to help you understand how, why, and where Java memory leaks happen – and what you can do...
Read more >
Definition of memory leak - PCMag
When memory is allocated, but not deallocated, a memory leak occurs (the memory has leaked out of the computer). If too many memory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found