question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: ParquetRowGroupReader.ReadColumnAsync returning wrong values for Int32 columns

See original GitHub issue

Library Version

4.16.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

TestParquet.csproj

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Parquet.Net" Version="4.15.0" />
  </ItemGroup>
</Project>

Program.cs

using Parquet;
using Parquet.Data;
using Parquet.Schema;
using Parquet.Serialization;
using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Threading.Tasks;

namespace TestParquet
{
    internal class Program
    {
        public class TestClass
        {
            public int Value { get; set; }

            private static Random r { get; } = new Random(0);

            public TestClass()
            {
                this.Value = TestClass.r.Next(int.MinValue, int.MaxValue);
            }
        }

        static async Task Main(string[] args)
        {
            // create items
            int itemCount = 4;
            List<TestClass> items = Enumerable.Range(0, itemCount).Select(i => new TestClass()).ToList();

            List<int> actualValues = new List<int>();
            using (MemoryStream ms = new MemoryStream())
            {
                // create parquet stream from items
                ParquetSerializerOptions options = new ParquetSerializerOptions()
                {
                    Append = false,
                    CompressionLevel = CompressionLevel.SmallestSize,
                    CompressionMethod = CompressionMethod.Gzip,
                };
                ParquetSchema schema = await ParquetSerializer.SerializeAsync(items, ms, options);
                ms.Position = 0;

                // read values in parquet stream
                DataField field = schema.DataFields[0];
                ParquetReader reader = await ParquetReader.CreateAsync(ms, leaveStreamOpen: true);
                for (int rowGroupIndex = 0; rowGroupIndex < reader.RowGroupCount; ++rowGroupIndex)
                {
                    using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(rowGroupIndex))
                    {
                        // if itemCount > 4096 then this throws InvalidOperationException: 'don't know how to skip'
                        DataColumn dc = await rowGroupReader.ReadColumnAsync(field);

                        actualValues.AddRange(dc.Data.Cast<int>());
                    }
                }
            }

            // check for differences between expected and actual values
            for (int i = 0; i < items.Count; ++i)
            {
                int expectedValue = items[i].Value;
                int actualValue = actualValues[i];
                if (expectedValue != actualValue)
                    Console.WriteLine($"i {i} expected {expectedValue}, actual {actualValue}");
            }
        }
    }
}

Failing test

When running the TestParquet console app using Parquet.Net 4.15.0, nothing is printed which indicates that all values were correctly read from the stream.

Change the Parquet.Net package to 4.16.0 and run the app again, it prints i 3 expected -1945678310, actual -737718758 indicating that the returned value for the 4th value in the data column is incorrect.

Also in 4.16.0, if you change itemCount to any value > 4096 then DataColumn dc = await rowGroupReader.ReadColumnAsync(field); throws InvalidOperationException: 'don't know how to skip'

Issue Analytics

  • State:closed
  • Created a month ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
El-Gor-docommented, Aug 21, 2023

Using Parquet.Net 4.16.1, the test app now correctly shows no output when itemCount <= 4096 but still throws InvalidOperationException: 'don't know how to skip' when itemCount > 4096.

0reactions
El-Gor-docommented, Aug 22, 2023

I have verified that v4.16.2 fixes InvalidOperationException: 'don't know how to skip'

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · aloneguid/parquet-dotnet
[BUG]: ParquetRowGroupReader.ReadColumnAsync returning wrong values for Int32 columns ... Add DELTA_BINARY_PACKED encoder For Int32 And Int64.
Read more >
Reading parquet file error 'Destination is too short' with ...
The code is failing at this line - DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found