question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark Iceberg manifest reports wrong parquet file sizes.

See original GitHub issue

We are using spark iceberg and some iceberg manifest files report the wrong data file (parquet) size, it’s ~ 2x larger than the actual parquet file size. The issue was found while investigating Presto Iceberg iss6369

the problem might be in ParquetWriter#length(), method

return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);

maybe that’s why a parquet file size in manifest > actual file size on drive

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
dmgcodevilcommented, Dec 23, 2020

Quick update: I’ve reproduced the issue. will update the ticket as soon as I have more details

0reactions
rdbluecommented, Jan 6, 2021

Thanks for the context, @dmgcodevil. That’s is definitely a problem. I think we will want to have a Trino fix for it, with the ability to fix metadata as a work-around until that is released. If you have a utility to share that fixes the metadata, I think that would be useful for other people. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Iceberg Table Spec
Version 1 of the Iceberg spec defines how to manage large analytic tables using immutable file formats: Parquet, Avro, and ORC. All version...
Read more >
Introduction to Apache Iceberg Using Spark - Dremio
Metadata files – Defines the table and tracks manifest lists, current and previous snapshots, schemas, and partition schemes. The Catalog. The ...
Read more >
Table Maintenance: The Key To Keeping Your Iceberg Tables ...
val df_202201 = spark.read.parquet("/home/iceberg/data/ ... Query the files table to see the data files and file sizes for the ...
Read more >
why Iceberg rewriteDataFiles doesn't rewrite the files to one file?
option("target-file-size-bytes", "52428800").execute();. but nothing changed. what I'm doing wrong? apache-spark · iceberg.
Read more >
Iceberg connector — Trino 392 Documentation
Iceberg data files can be stored in either Parquet, ORC or Avro format, as determined by the format property in the table definition....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found