Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect comparisson of parquet binary statistics for accented characters

See original GitHub issue

On version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201

I have added a simple test case on TupleDomainParquetPredicateTest that should not fail

    @Test
    public void testAccentedString() throws ParquetCorruptionException {
        String column = "StringColumn";

        assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
                true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
    }

but it fails with

com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]

Áncash comes before China, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics. As additional information, the files that led me to this error were generated by spark

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:23 (8 by maintainers)

Top GitHub Comments

1reaction

thoralf-gutierrezcommented, Sep 4, 2019

ping @findepi – can someone from the Presto team take a look at this?

1reaction

thoralf-gutierrezcommented, Jun 15, 2019

We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.

If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.

Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don’t mean to imply that Parquet’s order is correct, while Presto’s isn’t – this is just an example)

In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.

In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.

Am I missing something?

Top Results From Across the Web

Troubleshoot the Parquet format connector - Azure Data ...

This article provides suggestions to troubleshoot common problems with the Parquet format connector in Azure Data Factory and Azure Synapse.

parquet.io.ParquetDecodingException: Can not read value at ...

Problem : Facing below issue while querying the data in impyla (data written by ... ParquetDecodingException: Can not read value at 0 in...

Encodings in Apache Parquet on waitingforcode.com

This post starts with a short reminder about encoding. The second part lists the encodings available in the version 2 of Parquet format....

Comparison of different file formats in Big Data - Adaltas

Parquet is a binary file containing metadata about their content. The column metadata is stored at the end of the file, which allows...

3.2. External Formats and Options - Tableau Help

Some external formats carry schema information (e.g., Apache Parquet) while ... is false (default), then Hyper will raise an error if invalid UTF-8...