question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Byte array column formatted in V4.2.3 or later causes read error in ParquetViewer

See original GitHub issue

Library Version: … 4.2.3 and later. Works fine with 4.3.2 and earlier.

.NET Version: … .Net 7

OS: … Windows 11

Expected Behaviour

Parquet file containing one or more byte array columns should be readable by utilities such as ParquetViewer …

Actual Behaviour

ParquetViewer 2.4.2.0 shows an error dialog with message “cannot find data type handler to create model schema for [n:mask, t:FIXED_LEN_BYTE_ARRAY, ct: <not set>, rt: OPTIONAL, c:0]”

I have included the steps I use to format the file, in case I am using the library incorrectly. It is possible of course that there is a bug in the parsing done by Parquet Viewer, but I have no way of determining which component is at fault - I only know there has been a regression in compatibility.

This seems like an error that lots of people would notice. Let me know if you think ParquetViewer is at fault, and I can try and find some other way of checking my files. It looks like ParquetViewer uses Parquet.Net though. Thanks! …

Steps to Reproduce

Create a file that contains a column created like this: var myData = new List<byte[]>();

then for each row: byte[] myBytes = new byte[some length]; myMaskData.Add(myBytes); …

Then to format the file:

Code Snippet

var myColumn = new DataColumn(new DataField<byte[]>("colname"), myData.ToArray());
var mySchema = new ParquetSchema(myColumn);

using (ParquetWriter myParquetWriter = await ParquetWriter.CreateAsync(mySchema, inRoiParquetStream, append: false).ConfigureAwait(false))
            {
                using (ParquetRowGroupWriter myGroupWriter = myParquetWriter.CreateRowGroup())
                {
                    await myGroupWriter.WriteColumnAsync(myColumn).ConfigureAwait(false);
                }
             }

Issue Analytics

  • State:closed
  • Created 8 months ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
aloneguidcommented, Jan 26, 2023

Yeah that’s due to how browser handles file streams, once I migrate to native file access API that should be comparable to desktop speed.

1reaction
Dave-Kiwicommented, Jan 25, 2023

Thanks! I appreciate your prompt response. I took a look at https://parquetdbg.aloneguid.uk/ - works great for most of my parquet formats (except for the ‘todo: array’ bit), but seems to hang on one larger (12MB) file (with a column with large byte[] type). But I’m sure you know that. Anyway, thanks for all the effort, and your online tool will be great.

Read more comments on GitHub >

github_iconTop Results From Across the Web

scala - Parquet Datatype Issue
I've exercised the above in scala using spark, i.e I was able to read parquet file and store it as impala table and...
Read more >
Apache Spark job fails with Parquet column cannot be ...
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >
io.trino.spi.TrinoException: Failed reading parquet data ...
After upgrading to 361, I'm facing an issue when running a fairly straight forward query: SELECT * FROM some_table WHERE some_column ...
Read more >
When Parquet Columns Get Too Big
Apache Parquet is a columnar file format. Common files we are used to, such as text files, CSV etc. store information one row...
Read more >
Encodings
Floating point types are encoded in IEEE. For the byte array type, it encodes the length as a 4 byte little endian, followed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found