question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EndOfStreamException - can't read empty data page

See original GitHub issue

Version: Parquet.Net latest source (v3.7.7)

Runtime Version: .Net Core v3.0, I presume - I am running the Parquet.Test tests in VS

OS: Windows

Expected behavior

ParquetReader.ReadEntireRowGroup to complete successfully with my test file that was output by Apache Spark 2.4.5. Sorry, cannot share this file.

Python’s pandas read_parquet reads it successfully.

Actual behavior

Message: 
    System.IO.EndOfStreamException : Unable to read beyond the end of the stream.
  Stack Trace: 
    BinaryReader.InternalRead(Int32 numBytes)
    BinaryReader.ReadInt32()
    RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid(BinaryReader reader, Int32 bitWidth, Int32 length, Int32[] dest, Int32 offset, Int32 pageSize) line 37
    DataColumnReader.ReadPlainDictionary(BinaryReader reader, Int32 maxReadCount, Int32[] dest, Int32 offset) line 279
    DataColumnReader.ReadColumn(BinaryReader reader, Encoding encoding, Int64 totalValues, Int32 maxReadCount, ColumnRawData cd) line 253
    DataColumnReader.ReadDataPage(PageHeader ph, ColumnRawData cd, Int64 maxValues) line 216
    DataColumnReader.Read() line 88
    ParquetRowGroupReader.ReadColumn(DataField field) line 64
    ParquetReader.ReadEntireRowGroup(Int32 rowGroupIndex) line 141
    ParquetReaderTest.Reads_Exception() line 24

Here is Parquet.Thrift.PageHeader.ToString() for the last page of this column - formatted for readability:

PageHeader(, Type: DATA_PAGE,
    Uncompressed_page_size: 8, Compressed_page_size: 28,
    Crc: -680454176,
    Data_page_header: DataPageHeader(, Num_values: 125, Encoding: PLAIN_DICTIONARY,
        Definition_level_encoding: RLE, Repetition_level_encoding: BIT_PACKED,
        Statistics: Statistics(Null_count: 125)
    )
)

Call graph:

Parquet.File.DataColumnReader
\-> ReadDataPage
    \-> ReadPageData                - ungzips the page, produces 8 bytes
    \-> ReadLevels                  - consumes 7 bytes.
    \-> ReadColumn
        \-> ReadPlainDictionary     - consumes the last byte.
            \-> GetRemainingLength  - returns 0, as expected.
            \-> RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid    <boom>

That method interprets “length == 0” to mean “length is unknown, find it in the stream”:

https://github.com/aloneguid/parquet-dotnet/blob/60e454520eae7f7945bea471b0e9cb888c09cae9/src/Parquet/File/Values/RunLengthBitPackingHybridValuesReader.cs#L35-L37

But the length really is 0 and trying to read from the empty stream causes the boom.

If I just have ReadPlainDictionary skip the ReadRleBitpackedHybrid call in this case, then all is well - The file’s remaining columns are read successfully. Is this a sensible solution?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ishepherdcommented, Dec 17, 2020

@peteriehl I actually have a change that fixes it, but it’s only on a private fork.

I did not raise the PR, because the file with the problem (used in the PR for a unit test) I am not able to share that file.

I think I should raise it anyway, even without a test, maybe you can build on that?

0reactions
tomasfaltcommented, Jan 15, 2021

Yes, but since it semi confidential I would rather send it to you privately. Please send your email to yewit41475@yutongdt.com and I will send it to you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

EndOfStreamException: Failed to Read past end of stream ...
This is just a function that deserializes and get your result back. Thank you @C.Evenhuis. /// <summary> ///Get data from a binary file....
Read more >
MySqlException (0x80004005): Reading from the stream ...
IO.EndOfStreamException: Attempted to read past the end of the stream. at MySql.Data.MySqlClient.MySqlStream.ReadFully(Stream stream, Byte[] ...
Read more >
Unable to read beyond the end of the stream. System.IO ...
Start the game, open console with ~ and type cow [worldspace] [x] [y] to load into the worldspace and cell so that No...
Read more >
Exception: Attempted to read past the end of the stream
I'm working on a project with the free version of Telerik Data ... EndOfStreamException: Attempted to read past the end of the stream....
Read more >
org.apache.zookeeper.server.ServerCnxn$ ...
int rc = sock.read(incomingBuffer); if (rc < 0) { throw new EndOfStreamException( "Unable to read additional data from client sessionid 0x" + Long....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found