question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Presto doesn't seem to be able to read encrypted Parquet data

See original GitHub issue

I am facing an issue where I am unable to query a Hive table whose data lies in S3. The data is encrypted with client side encryption with a custom material set. I am able to query the data in Hive (using the same encryption materials provider) but the same query fails in Presto with the below exception:

com.facebook.presto.spi.PrestoException: Error opening Hive split s3://bucket/data/year=2017/month=01/day=08/hour=01/part-r-00027.snappy.parquet (offset=0, length=40563366): null
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:385)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:157)
        at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createRecordCursor(ParquetRecordCursorProvider.java:92)
        at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:155)
        at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
        at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
        at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
        at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:244)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:534)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:670)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:71)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.lambda$createParquetRecordReader$0(ParquetHiveRecordCursor.java:332)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:332)
        ... 15 more

I am using EMR. Details:

Emr Release Label: emr-5.2.1 Hive version: 2.1.0 Presto version: 0.157.1

Any pointers?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
cjangristcommented, Nov 22, 2017

As an FYI for other folks ending up in this thread, if you’re using EMR you can make spark and hadoop write this header when writing to S3 by adding this to cluster configuration:

  {
    "Classification":"emrfs-site",
    "Properties": {
       "fs.s3n.multipart.uploads.enabled": "false"
    }
0reactions
nezihyigitbasicommented, Jul 5, 2017

Yep, please use the workaround mentioned above.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hive to read encrypted Parquet files - Stack Overflow
I have an encrypted parquet file that I would like to read in Hive through an external table. If the file is not...
Read more >
Presto Parquet Column Encryption ·
The work to support native writer Parquet encryption is in progress. Parquet reader was rewritten in Presto, which requires the decryption code ...
Read more >
Parquet Column Level Access Control with Presto - YouTube
Apache Parquet is the major columnar file storage format used by Apache Presto and several other query engines in many big data analytic ......
Read more >
Evaluating Big Data Performance of PrestoDB and Parquet on ...
After I created that table, I created another external table using S3 storage, but this time I used the parquet format. I'm including...
Read more >
is there way to avoid invalid parquet files ? - Google Groups
This means in the case of parquet if you read the file meta-data (parquet row group header) from T0 but the actual data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found