Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParquetWriter - Byte[] Impacting File Size Outputs

See original GitHub issue

Version: v3.9.1 -> v 3.7.7 -> v3.5.3 -> 3.3.11 -> 3.1.4 Runtime Version: NET 4.6.2, NET 4.7.2, NET 4.8, and NET6 OS: Windows 10

Expected behavior

One million byte array written to Parquet should be around 1 MB file size + Parquet memory overhead for file structure and schema values. Perhaps smaller with column based compression.

Actual behavior

One million byte array (as per example) written to Parquet is around 9 MB. This is linear across all sizes. 1 GB becomes ~9 GB etc. Same issues with very large strings.

Steps to reproduce the behavior

Create an array with a large amount of data.
Create an appropriate DataColumn, Schema, Field, and array of byte[] (byte[][]);
Create a ParquetWriter, TreatByteArrayAsString = false, CompresionMethod Snappy or None (doesn’t change).
Create a RowGroup.
Write the one DataColumn.
Observe the output file size is vastly larger than expected.

Additional Details

This example is using random bytes and thus will not compress very well.

I have included just a raw base64 conversion which shows around 33% increase over 1 MB but that’s due to base64 string encoding.

As before with my last issue (if this is a real issue and not just my Parquet ignorance) happy to help out but would love a point in the right direction. My gut feeling is that a buffer length is being written, not the actual content length in the buffer.

Code

public static async Task ParquetNet_FileSizeIssue_Reproducible_Async()
{
    var testStringFile = @".\test\test.txt";
    var testParquetFile = @".\test\test-000.parquet";

    if (!Directory.Exists("test"))
    { Directory.CreateDirectory("test"); }

    if (!File.Exists(testParquetFile))
    { File.Delete(testParquetFile); }

    if (!File.Exists(testStringFile))
    { File.Delete(testStringFile); }

    var oneMillionBytes = 1 * 1024 * 1024;
    var rand = new Random();
    var data = new byte[1][];
    data[0] = new byte[oneMillionBytes];
    rand.NextBytes(data[0]);

    var field = new DataField("Data", DataType.ByteArray, hasNulls: true);
    var schema = new Schema(field);
    var dataColumn = new DataColumn(field, data);

    using var fileStream1 = File.Open(testParquetFile, FileMode.OpenOrCreate);
    using var parquetWriter = new ParquetWriter(schema, fileStream1, formatOptions: new ParquetOptions { TreatByteArrayAsString = false, TreatBigIntegersAsDates = false })
    {
        CompressionMethod = CompressionMethod.Snappy,
    };

    using var rowGroupWriter = parquetWriter.CreateRowGroup();
    rowGroupWriter.WriteColumn(dataColumn);

    using var fileStream2 = File.Open(testStringFile, FileMode.OpenOrCreate);
    using var streamWriter = new StreamWriter(fileStream2);

    await streamWriter.WriteLineAsync(Convert.ToBase64String(data[0]));
}

test-000.zip

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

aloneguidcommented, Jan 19, 2023

dictionary encoding is fully supported now. Feel free to reopen if any issues.

1reaction

aloneguidcommented, Jan 11, 2023

Btw thank you for detailed investigations, this is awesome.

The other thing - parquet.net does not detect repeated strings at the moment (Dictionary Encoding) but this will come at some point too. Calculating distinct values is extremely CPU and memory expensive on large arrays.

Top Results From Across the Web

apache spark - How do you control the size of the output file?

It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before ......

Parquet Format - Apache Drill

Configuring the size of Parquet files by setting the store.parquet.block-size can improve write performance. The block size is the size of MFS, HDFS,...

Write columnar data to Parquet file - MATLAB parquetwrite

Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same...

Load data incrementally and optimized Parquet writer with ...

The pageSize specifies the size of the smallest unit in a Parquet file that must be read fully to access a single record....

Parquet Output

The Parquet Output step allows you to map PDI fields to fields within data files and choose where you want to process those...