EndOfStreamException - can't read empty data page
See original GitHub issueVersion: Parquet.Net latest source (v3.7.7)
Runtime Version: .Net Core v3.0, I presume - I am running the Parquet.Test tests in VS
OS: Windows
Expected behavior
ParquetReader.ReadEntireRowGroup to complete successfully with my test file that was output by Apache Spark 2.4.5. Sorry, cannot share this file.
Python’s pandas read_parquet reads it successfully.
Actual behavior
Message:
System.IO.EndOfStreamException : Unable to read beyond the end of the stream.
Stack Trace:
BinaryReader.InternalRead(Int32 numBytes)
BinaryReader.ReadInt32()
RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid(BinaryReader reader, Int32 bitWidth, Int32 length, Int32[] dest, Int32 offset, Int32 pageSize) line 37
DataColumnReader.ReadPlainDictionary(BinaryReader reader, Int32 maxReadCount, Int32[] dest, Int32 offset) line 279
DataColumnReader.ReadColumn(BinaryReader reader, Encoding encoding, Int64 totalValues, Int32 maxReadCount, ColumnRawData cd) line 253
DataColumnReader.ReadDataPage(PageHeader ph, ColumnRawData cd, Int64 maxValues) line 216
DataColumnReader.Read() line 88
ParquetRowGroupReader.ReadColumn(DataField field) line 64
ParquetReader.ReadEntireRowGroup(Int32 rowGroupIndex) line 141
ParquetReaderTest.Reads_Exception() line 24
Here is Parquet.Thrift.PageHeader.ToString() for the last page of this column - formatted for readability:
PageHeader(, Type: DATA_PAGE,
Uncompressed_page_size: 8, Compressed_page_size: 28,
Crc: -680454176,
Data_page_header: DataPageHeader(, Num_values: 125, Encoding: PLAIN_DICTIONARY,
Definition_level_encoding: RLE, Repetition_level_encoding: BIT_PACKED,
Statistics: Statistics(Null_count: 125)
)
)
Call graph:
Parquet.File.DataColumnReader
\-> ReadDataPage
\-> ReadPageData - ungzips the page, produces 8 bytes
\-> ReadLevels - consumes 7 bytes.
\-> ReadColumn
\-> ReadPlainDictionary - consumes the last byte.
\-> GetRemainingLength - returns 0, as expected.
\-> RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid <boom>
That method interprets “length == 0” to mean “length is unknown, find it in the stream”:
But the length really is 0 and trying to read from the empty stream causes the boom.
If I just have ReadPlainDictionary skip the ReadRleBitpackedHybrid call in this case, then all is well - The file’s remaining columns are read successfully. Is this a sensible solution?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
@peteriehl I actually have a change that fixes it, but it’s only on a private fork.
I did not raise the PR, because the file with the problem (used in the PR for a unit test) I am not able to share that file.
I think I should raise it anyway, even without a test, maybe you can build on that?
Yes, but since it semi confidential I would rather send it to you privately. Please send your email to yewit41475@yutongdt.com and I will send it to you.