question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider making the ByteBuffer public

See original GitHub issue

Hi,

I wanted to suggest making ByteBuffer.cs public instead of internal, and possibly allowing inheritance.

Context is that I have several TBs of gzipped csvs to turn into parquet, and in order to save on time and memory, I skip allocating the UTF8 bytes as strings entirely. There did not seem to be a good way to use the Logical writers to submit a ReadOnlySpan<byte> for a string-type column, so I had to go for the physical writers and use ByteArray instead.

I basically had to build my own version of the ByteBuffer ( I added some deduping / re-using the same ByteArray to further save on memory ), and as such I was thinking it would be lovely if I could have just extended your ByteBuffer class instead. It would certainly make it easier for others who just want to prepare a bunch of a UTF8 bytes / ReadOnlySpan<byte> to write to a column.

Cheers, and many thanks for the great library. It performs very well.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Bio2hazardcommented, Apr 3, 2020

Hi, thank you so much!

I was a bit under time pressure to finish my project and was unable to wait for the library update that made the ByteBuffer public. But as a way of giving back I wanted to share what I ended up doing.

I discovered that in .net core 3 they added a ArrayBufferWriter which is a bit different than the ByteBuffer but solved my problem all the same. ByteBuffer seems like it could be an implementation of MemoryManager and behaves somewhat similar to RecyclableMemoryStream.

ArrayBufferWriter instead maintains a single contiguous memory area. Growing the ArrayBufferWriter is expensive as it needs to allocate the original size + 50% as a entirely new memory area and then copy all the existing data over in order to maintain a single contiguous memory area. In my case however the data is very predictable, so I can specify a initial buffer size and in 97% of all cases it does not need to grow.

Due to the amount of input data it was a priority to ensure that each server can run as many containers of the ALB to parquet converter as possible, so I had to significantly limit the memory available. To prevent OOM due to buffer resizes, I periodically check the available space in the ArrayBufferWriter and finish writing the row group early if necessary.

Finally, I only pin the memory for each column just before writing it.

I chose that approach because in my testing and benchmarking, pinning ended up getting slower and slower, in particular when the app is already working under memory constraints.

Anyhow, my use case is very specific and I would not recommend this approach unless you have both very predictable data and need to significantly optimize performance and memory usage.

That all said it was a great success in large parts to your library, so thank you! I was able to convert 14+ TB in roughly 8 hours and the bottleneck was actually read network IO ( if the data had been local I estimate it would have been able to do the work in ~3 hours ). Oh and doing it manually like this cost under $10 for the cloud computing power. We did actually do a proof of concept with a Spark cluster, which would have taken ~5 days and cost about $10,000. GG 👍

1reaction
GPSnoopycommented, Feb 20, 2020

I’ve merged the public change to master.

I wonder if the simple WriteBatch calls fails because the schema node is marked as Repetition.Optional (this is the default for a Column<string> as the type is nullable)? This would cause Parquet C++ to expect the defLevels and fail if it’s not there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ByteBuffer (Java Platform SE 8 )
A byte buffer. This class defines six categories of operations upon byte buffers: Absolute and relative get and put methods that read and...
Read more >
Guide to ByteBuffer
The Buffer classes are the foundation upon which Java NIO is built. However, in these classes, the ByteBuffer class is most preferred.
Read more >
How to initialize a ByteBuffer if you don't know how many ...
If you need to read an unknown amount of data using a ByteBuffer , consider using a loop with your buffer and append...
Read more >
ByteBuffer is not exposing the Array() method · Issue #5347
Steps to Reproduce Create a ByteArray object allocating any number of items var byteBuffer = ByteBuffer.Allocate(2).
Read more >
ByteBuffer
Methods for creating view buffers, which allow a byte buffer to be viewed as a buffer containing ... public static ByteBuffer allocateDirect (int...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found