Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Presto doesn't seem to be able to read encrypted Parquet data

See original GitHub issue

I am facing an issue where I am unable to query a Hive table whose data lies in S3. The data is encrypted with client side encryption with a custom material set. I am able to query the data in Hive (using the same encryption materials provider) but the same query fails in Presto with the below exception:

com.facebook.presto.spi.PrestoException: Error opening Hive split s3://bucket/data/year=2017/month=01/day=08/hour=01/part-r-00027.snappy.parquet (offset=0, length=40563366): null
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:385)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:157)
        at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createRecordCursor(ParquetRecordCursorProvider.java:92)
        at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:155)
        at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
        at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
        at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
        at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:244)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:534)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:670)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:71)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.lambda$createParquetRecordReader$0(ParquetHiveRecordCursor.java:332)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:332)
        ... 15 more

I am using EMR. Details:

Emr Release Label: emr-5.2.1 Hive version: 2.1.0 Presto version: 0.157.1

Any pointers?

Issue Analytics

State:
Created 7 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

cjangristcommented, Nov 22, 2017

As an FYI for other folks ending up in this thread, if you’re using EMR you can make spark and hadoop write this header when writing to S3 by adding this to cluster configuration:

  {
    "Classification":"emrfs-site",
    "Properties": {
       "fs.s3n.multipart.uploads.enabled": "false"
    }

0reactions

nezihyigitbasicommented, Jul 5, 2017

Yep, please use the workaround mentioned above.

Top Results From Across the Web

Hive to read encrypted Parquet files - Stack Overflow

I have an encrypted parquet file that I would like to read in Hive through an external table. If the file is not...

Presto Parquet Column Encryption ·

The work to support native writer Parquet encryption is in progress. Parquet reader was rewritten in Presto, which requires the decryption code ...

Parquet Column Level Access Control with Presto - YouTube

Apache Parquet is the major columnar file storage format used by Apache Presto and several other query engines in many big data analytic ......

Evaluating Big Data Performance of PrestoDB and Parquet on ...

After I created that table, I created another external table using S3 storage, but this time I used the parquet format. I'm including...

is there way to avoid invalid parquet files ? - Google Groups

This means in the case of parquet if you read the file meta-data (parquet row group header) from T0 but the actual data...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Presto doesn't seem to be able to read encrypted Parquet data

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Window function dependencies not in source plan output

TIMESTAMP behaviour does not match sql standard