ParquetWriter - Byte[] Impacting File Size Outputs
See original GitHub issueVersion: v3.9.1 -> v 3.7.7 -> v3.5.3 -> 3.3.11 -> 3.1.4 Runtime Version: NET 4.6.2, NET 4.7.2, NET 4.8, and NET6 OS: Windows 10
Expected behavior
One million byte array written to Parquet should be around 1 MB file size + Parquet memory overhead for file structure and schema values. Perhaps smaller with column based compression.
Actual behavior
One million byte array (as per example) written to Parquet is around 9 MB. This is linear across all sizes. 1 GB becomes ~9 GB etc. Same issues with very large strings.
Steps to reproduce the behavior
- Create an array with a large amount of data.
- Create an appropriate DataColumn, Schema, Field, and array of byte[] (byte[][]);
- Create a ParquetWriter, TreatByteArrayAsString = false, CompresionMethod Snappy or None (doesn’t change).
- Create a RowGroup.
- Write the one DataColumn.
- Observe the output file size is vastly larger than expected.
Additional Details
This example is using random bytes and thus will not compress very well.
I have included just a raw base64 conversion which shows around 33% increase over 1 MB but that’s due to base64 string encoding.
As before with my last issue (if this is a real issue and not just my Parquet ignorance) happy to help out but would love a point in the right direction. My gut feeling is that a buffer length is being written, not the actual content length in the buffer.
Code
public static async Task ParquetNet_FileSizeIssue_Reproducible_Async()
{
var testStringFile = @".\test\test.txt";
var testParquetFile = @".\test\test-000.parquet";
if (!Directory.Exists("test"))
{ Directory.CreateDirectory("test"); }
if (!File.Exists(testParquetFile))
{ File.Delete(testParquetFile); }
if (!File.Exists(testStringFile))
{ File.Delete(testStringFile); }
var oneMillionBytes = 1 * 1024 * 1024;
var rand = new Random();
var data = new byte[1][];
data[0] = new byte[oneMillionBytes];
rand.NextBytes(data[0]);
var field = new DataField("Data", DataType.ByteArray, hasNulls: true);
var schema = new Schema(field);
var dataColumn = new DataColumn(field, data);
using var fileStream1 = File.Open(testParquetFile, FileMode.OpenOrCreate);
using var parquetWriter = new ParquetWriter(schema, fileStream1, formatOptions: new ParquetOptions { TreatByteArrayAsString = false, TreatBigIntegersAsDates = false })
{
CompressionMethod = CompressionMethod.Snappy,
};
using var rowGroupWriter = parquetWriter.CreateRowGroup();
rowGroupWriter.WriteColumn(dataColumn);
using var fileStream2 = File.Open(testStringFile, FileMode.OpenOrCreate);
using var streamWriter = new StreamWriter(fileStream2);
await streamWriter.WriteLineAsync(Convert.ToBase64String(data[0]));
}
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
dictionary encoding is fully supported now. Feel free to reopen if any issues.
Btw thank you for detailed investigations, this is awesome.
The other thing - parquet.net does not detect repeated strings at the moment (Dictionary Encoding) but this will come at some point too. Calculating distinct values is extremely CPU and memory expensive on large arrays.