Presto doesn't seem to be able to read encrypted Parquet data
See original GitHub issueI am facing an issue where I am unable to query a Hive table whose data lies in S3. The data is encrypted with client side encryption with a custom material set. I am able to query the data in Hive (using the same encryption materials provider) but the same query fails in Presto with the below exception:
com.facebook.presto.spi.PrestoException: Error opening Hive split s3://bucket/data/year=2017/month=01/day=08/hour=01/part-r-00027.snappy.parquet (offset=0, length=40563366): null
at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:385)
at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:157)
at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createRecordCursor(ParquetRecordCursorProvider.java:92)
at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:155)
at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:244)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:378)
at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:534)
at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:71)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.lambda$createParquetRecordReader$0(ParquetHiveRecordCursor.java:332)
at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:332)
... 15 more
I am using EMR. Details:
Emr Release Label: emr-5.2.1 Hive version: 2.1.0 Presto version: 0.157.1
Any pointers?
Issue Analytics
- State:
- Created 7 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Hive to read encrypted Parquet files - Stack Overflow
I have an encrypted parquet file that I would like to read in Hive through an external table. If the file is not...
Read more >Presto Parquet Column Encryption ·
The work to support native writer Parquet encryption is in progress. Parquet reader was rewritten in Presto, which requires the decryption code ...
Read more >Parquet Column Level Access Control with Presto - YouTube
Apache Parquet is the major columnar file storage format used by Apache Presto and several other query engines in many big data analytic ......
Read more >Evaluating Big Data Performance of PrestoDB and Parquet on ...
After I created that table, I created another external table using S3 storage, but this time I used the parquet format. I'm including...
Read more >is there way to avoid invalid parquet files ? - Google Groups
This means in the case of parquet if you read the file meta-data (parquet row group header) from T0 but the actual data...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
As an FYI for other folks ending up in this thread, if you’re using EMR you can make spark and hadoop write this header when writing to S3 by adding this to cluster configuration:
Yep, please use the workaround mentioned above.