question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect comparisson of parquet binary statistics for accented characters

See original GitHub issue

On version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201

I have added a simple test case on TupleDomainParquetPredicateTest that should not fail

    @Test
    public void testAccentedString() throws ParquetCorruptionException {
        String column = "StringColumn";

        assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
                true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
    }

but it fails with

com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]

Áncash comes before China, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics. As additional information, the files that led me to this error were generated by spark

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:3
  • Comments:23 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
thoralf-gutierrezcommented, Sep 4, 2019

ping @findepi – can someone from the Presto team take a look at this?

1reaction
thoralf-gutierrezcommented, Jun 15, 2019

We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.

If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.

Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don’t mean to imply that Parquet’s order is correct, while Presto’s isn’t – this is just an example)

In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.

In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.

Am I missing something?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot the Parquet format connector - Azure Data ...
This article provides suggestions to troubleshoot common problems with the Parquet format connector in Azure Data Factory and Azure Synapse.
Read more >
parquet.io.ParquetDecodingException: Can not read value at ...
Problem : Facing below issue while querying the data in impyla (data written by ... ParquetDecodingException: Can not read value at 0 in...
Read more >
Encodings in Apache Parquet on waitingforcode.com
This post starts with a short reminder about encoding. The second part lists the encodings available in the version 2 of Parquet format....
Read more >
Comparison of different file formats in Big Data - Adaltas
Parquet is a binary file containing metadata about their content. The column metadata is stored at the end of the file, which allows...
Read more >
3.2. External Formats and Options - Tableau Help
Some external formats carry schema information (e.g., Apache Parquet) while ... is false (default), then Hyper will raise an error if invalid UTF-8...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found