Consider making the ByteBuffer public
See original GitHub issueHi,
I wanted to suggest making ByteBuffer.cs public instead of internal, and possibly allowing inheritance.
Context is that I have several TBs of gzipped csvs to turn into parquet, and in order to save on time and memory, I skip allocating the UTF8 bytes as strings entirely. There did not seem to be a good way to use the Logical writers to submit a ReadOnlySpan<byte>
for a string-type column, so I had to go for the physical writers and use ByteArray
instead.
I basically had to build my own version of the ByteBuffer ( I added some deduping / re-using the same ByteArray to further save on memory ), and as such I was thinking it would be lovely if I could have just extended your ByteBuffer class instead. It would certainly make it easier for others who just want to prepare a bunch of a UTF8 bytes / ReadOnlySpan<byte>
to write to a column.
Cheers, and many thanks for the great library. It performs very well.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Hi, thank you so much!
I was a bit under time pressure to finish my project and was unable to wait for the library update that made the ByteBuffer public. But as a way of giving back I wanted to share what I ended up doing.
I discovered that in .net core 3 they added a ArrayBufferWriter which is a bit different than the ByteBuffer but solved my problem all the same. ByteBuffer seems like it could be an implementation of MemoryManager and behaves somewhat similar to RecyclableMemoryStream.
ArrayBufferWriter instead maintains a single contiguous memory area. Growing the ArrayBufferWriter is expensive as it needs to allocate the original size + 50% as a entirely new memory area and then copy all the existing data over in order to maintain a single contiguous memory area. In my case however the data is very predictable, so I can specify a initial buffer size and in 97% of all cases it does not need to grow.
Due to the amount of input data it was a priority to ensure that each server can run as many containers of the ALB to parquet converter as possible, so I had to significantly limit the memory available. To prevent OOM due to buffer resizes, I periodically check the available space in the ArrayBufferWriter and finish writing the row group early if necessary.
Finally, I only pin the memory for each column just before writing it.
I chose that approach because in my testing and benchmarking, pinning ended up getting slower and slower, in particular when the app is already working under memory constraints.
Anyhow, my use case is very specific and I would not recommend this approach unless you have both very predictable data and need to significantly optimize performance and memory usage.
That all said it was a great success in large parts to your library, so thank you! I was able to convert 14+ TB in roughly 8 hours and the bottleneck was actually read network IO ( if the data had been local I estimate it would have been able to do the work in ~3 hours ). Oh and doing it manually like this cost under $10 for the cloud computing power. We did actually do a proof of concept with a Spark cluster, which would have taken ~5 days and cost about $10,000. GG 👍
I’ve merged the
public
change to master.I wonder if the simple
WriteBatch
calls fails because the schema node is marked asRepetition.Optional
(this is the default for aColumn<string>
as the type is nullable)? This would cause Parquet C++ to expect thedefLevels
and fail if it’s not there.