Incorrect comparisson of parquet binary statistics for accented characters
See original GitHub issueOn version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201
I have added a simple test case on TupleDomainParquetPredicateTest
that should not fail
@Test
public void testAccentedString() throws ParquetCorruptionException {
String column = "StringColumn";
assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
}
but it fails with
com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]
Áncash
comes before China
, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics.
As additional information, the files that led me to this error were generated by spark
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:23 (8 by maintainers)
Top Results From Across the Web
Troubleshoot the Parquet format connector - Azure Data ...
This article provides suggestions to troubleshoot common problems with the Parquet format connector in Azure Data Factory and Azure Synapse.
Read more >parquet.io.ParquetDecodingException: Can not read value at ...
Problem : Facing below issue while querying the data in impyla (data written by ... ParquetDecodingException: Can not read value at 0 in...
Read more >Encodings in Apache Parquet on waitingforcode.com
This post starts with a short reminder about encoding. The second part lists the encodings available in the version 2 of Parquet format....
Read more >Comparison of different file formats in Big Data - Adaltas
Parquet is a binary file containing metadata about their content. The column metadata is stored at the end of the file, which allows...
Read more >3.2. External Formats and Options - Tableau Help
Some external formats carry schema information (e.g., Apache Parquet) while ... is false (default), then Hyper will raise an error if invalid UTF-8...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
ping @findepi – can someone from the Presto team take a look at this?
We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.
If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.
Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don’t mean to imply that Parquet’s order is correct, while Presto’s isn’t – this is just an example)
In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.
In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.
Am I missing something?