question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ParquetWriter - Byte[] Impacting File Size Outputs

See original GitHub issue

Version: v3.9.1 -> v 3.7.7 -> v3.5.3 -> 3.3.11 -> 3.1.4 Runtime Version: NET 4.6.2, NET 4.7.2, NET 4.8, and NET6 OS: Windows 10

Expected behavior

One million byte array written to Parquet should be around 1 MB file size + Parquet memory overhead for file structure and schema values. Perhaps smaller with column based compression.

Actual behavior

One million byte array (as per example) written to Parquet is around 9 MB. This is linear across all sizes. 1 GB becomes ~9 GB etc. Same issues with very large strings.

Steps to reproduce the behavior

  1. Create an array with a large amount of data.
  2. Create an appropriate DataColumn, Schema, Field, and array of byte[] (byte[][]);
  3. Create a ParquetWriter, TreatByteArrayAsString = false, CompresionMethod Snappy or None (doesn’t change).
  4. Create a RowGroup.
  5. Write the one DataColumn.
  6. Observe the output file size is vastly larger than expected.

Additional Details

This example is using random bytes and thus will not compress very well.

I have included just a raw base64 conversion which shows around 33% increase over 1 MB but that’s due to base64 string encoding.

As before with my last issue (if this is a real issue and not just my Parquet ignorance) happy to help out but would love a point in the right direction. My gut feeling is that a buffer length is being written, not the actual content length in the buffer.

Code

public static async Task ParquetNet_FileSizeIssue_Reproducible_Async()
{
    var testStringFile = @".\test\test.txt";
    var testParquetFile = @".\test\test-000.parquet";

    if (!Directory.Exists("test"))
    { Directory.CreateDirectory("test"); }

    if (!File.Exists(testParquetFile))
    { File.Delete(testParquetFile); }

    if (!File.Exists(testStringFile))
    { File.Delete(testStringFile); }

    var oneMillionBytes = 1 * 1024 * 1024;
    var rand = new Random();
    var data = new byte[1][];
    data[0] = new byte[oneMillionBytes];
    rand.NextBytes(data[0]);

    var field = new DataField("Data", DataType.ByteArray, hasNulls: true);
    var schema = new Schema(field);
    var dataColumn = new DataColumn(field, data);

    using var fileStream1 = File.Open(testParquetFile, FileMode.OpenOrCreate);
    using var parquetWriter = new ParquetWriter(schema, fileStream1, formatOptions: new ParquetOptions { TreatByteArrayAsString = false, TreatBigIntegersAsDates = false })
    {
        CompressionMethod = CompressionMethod.Snappy,
    };

    using var rowGroupWriter = parquetWriter.CreateRowGroup();
    rowGroupWriter.WriteColumn(dataColumn);

    using var fileStream2 = File.Open(testStringFile, FileMode.OpenOrCreate);
    using var streamWriter = new StreamWriter(fileStream2);

    await streamWriter.WriteLineAsync(Convert.ToBase64String(data[0]));
}

image test-000.zip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
aloneguidcommented, Jan 19, 2023

dictionary encoding is fully supported now. Feel free to reopen if any issues.

1reaction
aloneguidcommented, Jan 11, 2023

Btw thank you for detailed investigations, this is awesome.

The other thing - parquet.net does not detect repeated strings at the moment (Dictionary Encoding) but this will come at some point too. Calculating distinct values is extremely CPU and memory expensive on large arrays.

Read more comments on GitHub >

github_iconTop Results From Across the Web

apache spark - How do you control the size of the output file?
It's impossible for Spark to control the size of Parquet files, because the DataFrame in memory needs to be encoded and compressed before ......
Read more >
Parquet Format - Apache Drill
Configuring the size of Parquet files by setting the store.parquet.block-size can improve write performance. The block size is the size of MFS, HDFS,...
Read more >
Write columnar data to Parquet file - MATLAB parquetwrite
Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same...
Read more >
Load data incrementally and optimized Parquet writer with ...
The pageSize specifies the size of the smallest unit in a Parquet file that must be read fully to access a single record....
Read more >
Parquet Output
The Parquet Output step allows you to map PDI fields to fields within data files and choose where you want to process those...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found