[BUG]: ParquetRowGroupReader.ReadColumnAsync returning wrong values for Int32 columns
See original GitHub issueLibrary Version
4.16.0
OS
Windows
OS Architecture
64 bit
How to reproduce?
TestParquet.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net7.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Parquet.Net" Version="4.15.0" />
</ItemGroup>
</Project>
Program.cs
using Parquet;
using Parquet.Data;
using Parquet.Schema;
using Parquet.Serialization;
using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Threading.Tasks;
namespace TestParquet
{
internal class Program
{
public class TestClass
{
public int Value { get; set; }
private static Random r { get; } = new Random(0);
public TestClass()
{
this.Value = TestClass.r.Next(int.MinValue, int.MaxValue);
}
}
static async Task Main(string[] args)
{
// create items
int itemCount = 4;
List<TestClass> items = Enumerable.Range(0, itemCount).Select(i => new TestClass()).ToList();
List<int> actualValues = new List<int>();
using (MemoryStream ms = new MemoryStream())
{
// create parquet stream from items
ParquetSerializerOptions options = new ParquetSerializerOptions()
{
Append = false,
CompressionLevel = CompressionLevel.SmallestSize,
CompressionMethod = CompressionMethod.Gzip,
};
ParquetSchema schema = await ParquetSerializer.SerializeAsync(items, ms, options);
ms.Position = 0;
// read values in parquet stream
DataField field = schema.DataFields[0];
ParquetReader reader = await ParquetReader.CreateAsync(ms, leaveStreamOpen: true);
for (int rowGroupIndex = 0; rowGroupIndex < reader.RowGroupCount; ++rowGroupIndex)
{
using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(rowGroupIndex))
{
// if itemCount > 4096 then this throws InvalidOperationException: 'don't know how to skip'
DataColumn dc = await rowGroupReader.ReadColumnAsync(field);
actualValues.AddRange(dc.Data.Cast<int>());
}
}
}
// check for differences between expected and actual values
for (int i = 0; i < items.Count; ++i)
{
int expectedValue = items[i].Value;
int actualValue = actualValues[i];
if (expectedValue != actualValue)
Console.WriteLine($"i {i} expected {expectedValue}, actual {actualValue}");
}
}
}
}
Failing test
When running the TestParquet console app using Parquet.Net 4.15.0, nothing is printed which indicates that all values were correctly read from the stream.
Change the Parquet.Net package to 4.16.0 and run the app again, it prints i 3 expected -1945678310, actual -737718758
indicating that the returned value for the 4th value in the data column is incorrect.
Also in 4.16.0, if you change itemCount to any value > 4096 then DataColumn dc = await rowGroupReader.ReadColumnAsync(field);
throws InvalidOperationException: 'don't know how to skip'
Issue Analytics
- State:
- Created a month ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Issues · aloneguid/parquet-dotnet
[BUG]: ParquetRowGroupReader.ReadColumnAsync returning wrong values for Int32 columns ... Add DELTA_BINARY_PACKED encoder For Int32 And Int64.
Read more >Reading parquet file error 'Destination is too short' with ...
The code is failing at this line - DataColumn column = await groupReader.ReadColumnAsync(dataFields[c]); ///ERROR.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Using Parquet.Net 4.16.1, the test app now correctly shows no output when
itemCount <= 4096
but still throwsInvalidOperationException: 'don't know how to skip'
whenitemCount > 4096
.I have verified that v4.16.2 fixes
InvalidOperationException: 'don't know how to skip'