Spark Iceberg manifest reports wrong parquet file sizes.
See original GitHub issueWe are using spark iceberg and some iceberg manifest files report the wrong data file (parquet) size, it’s ~ 2x larger than the actual parquet file size. The issue was found while investigating Presto Iceberg iss6369
the problem might be in ParquetWriter#length()
, method
return writer.getPos() + (writeStore.isColumnFlushNeeded() ? writeStore.getBufferedSize() : 0);
maybe that’s why a parquet file size in manifest > actual file size on drive
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Iceberg Table Spec
Version 1 of the Iceberg spec defines how to manage large analytic tables using immutable file formats: Parquet, Avro, and ORC. All version...
Read more >Introduction to Apache Iceberg Using Spark - Dremio
Metadata files – Defines the table and tracks manifest lists, current and previous snapshots, schemas, and partition schemes. The Catalog. The ...
Read more >Table Maintenance: The Key To Keeping Your Iceberg Tables ...
val df_202201 = spark.read.parquet("/home/iceberg/data/ ... Query the files table to see the data files and file sizes for the ...
Read more >why Iceberg rewriteDataFiles doesn't rewrite the files to one file?
option("target-file-size-bytes", "52428800").execute();. but nothing changed. what I'm doing wrong? apache-spark · iceberg.
Read more >Iceberg connector — Trino 392 Documentation
Iceberg data files can be stored in either Parquet, ORC or Avro format, as determined by the format property in the table definition....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Quick update: I’ve reproduced the issue. will update the ticket as soon as I have more details
Thanks for the context, @dmgcodevil. That’s is definitely a problem. I think we will want to have a Trino fix for it, with the ability to fix metadata as a work-around until that is released. If you have a utility to share that fixes the metadata, I think that would be useful for other people. Thanks!