Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Byte array column formatted in V4.2.3 or later causes read error in ParquetViewer

See original GitHub issue

Library Version: … 4.2.3 and later. Works fine with 4.3.2 and earlier.

.NET Version: … .Net 7

OS: … Windows 11

Expected Behaviour

Parquet file containing one or more byte array columns should be readable by utilities such as ParquetViewer …

Actual Behaviour

ParquetViewer 2.4.2.0 shows an error dialog with message “cannot find data type handler to create model schema for [n:mask, t:FIXED_LEN_BYTE_ARRAY, ct: <not set>, rt: OPTIONAL, c:0]”

I have included the steps I use to format the file, in case I am using the library incorrectly. It is possible of course that there is a bug in the parsing done by Parquet Viewer, but I have no way of determining which component is at fault - I only know there has been a regression in compatibility.

This seems like an error that lots of people would notice. Let me know if you think ParquetViewer is at fault, and I can try and find some other way of checking my files. It looks like ParquetViewer uses Parquet.Net though. Thanks! …

Steps to Reproduce

Create a file that contains a column created like this: var myData = new List<byte[]>();

then for each row: byte[] myBytes = new byte[some length]; myMaskData.Add(myBytes); …

Then to format the file:

Code Snippet

var myColumn = new DataColumn(new DataField<byte[]>("colname"), myData.ToArray());
var mySchema = new ParquetSchema(myColumn);

using (ParquetWriter myParquetWriter = await ParquetWriter.CreateAsync(mySchema, inRoiParquetStream, append: false).ConfigureAwait(false))
            {
                using (ParquetRowGroupWriter myGroupWriter = myParquetWriter.CreateRowGroup())
                {
                    await myGroupWriter.WriteColumnAsync(myColumn).ConfigureAwait(false);
                }
             }

Issue Analytics

State:
Created 8 months ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

aloneguidcommented, Jan 26, 2023

Yeah that’s due to how browser handles file streams, once I migrate to native file access API that should be comparable to desktop speed.

1reaction

Dave-Kiwicommented, Jan 25, 2023

Thanks! I appreciate your prompt response. I took a look at https://parquetdbg.aloneguid.uk/ - works great for most of my parquet formats (except for the ‘todo: array’ bit), but seems to hang on one larger (12MB) file (with a column with large byte[] type). But I’m sure you know that. Anyway, thanks for all the effort, and your online tool will be great.

Top Results From Across the Web

scala - Parquet Datatype Issue

I've exercised the above in scala using spark, i.e I was able to read parquet file and store it as impala table and...

Apache Spark job fails with Parquet column cannot be ...

Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...

io.trino.spi.TrinoException: Failed reading parquet data ...

After upgrading to 361, I'm facing an issue when running a fairly straight forward query: SELECT * FROM some_table WHERE some_column ...

When Parquet Columns Get Too Big

Apache Parquet is a columnar file format. Common files we are used to, such as text files, CSV etc. store information one row...

Encodings

Floating point types are encoded in IEEE. For the byte array type, it encodes the length as a 4 byte little endian, followed...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Byte array column formatted in V4.2.3 or later causes read error in ParquetViewer

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Code Snippet

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Can I read DataColumnStatistics of the column only before reading the entire column data ?

🈁💌💗 Post your use cases / software / screenshots here