question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FIXED] Presto cannot query hudi table

See original GitHub issue

Describe the problem you faced

I made a non-partitioned Hudi table using Spark. I was able to query it with Spark & Hive, but when I tried querying it with Presto, I received the error Could not find partitionDepth in partition metafile.

To Reproduce

Steps to reproduce the behavior:

  1. Use an an emr-5.28.0 cluster
  2. Run spark shell:
spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --deploy-mode client
  1. Run spark code:
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.hive._
import org.apache.hudi.keygen.NonpartitionedKeyGenerator

val inputPath = "s3://path/to/a/parquet/file"
val tableName = "my_test_table"
val basePath = "s3://test-bucket/my_test_table" 

val inputDf = spark.read.parquet(inputPath)

val hudiOptions = Map[String,String](
    RECORDKEY_FIELD_OPT_KEY -> "dim_advertiser_id",
    PRECOMBINE_FIELD_OPT_KEY -> "update_time",
    TABLE_NAME -> tableName,
    KEYGENERATOR_CLASS_OPT_KEY -> classOf[NonpartitionedKeyGenerator].getCanonicalName, //needed for non partitioned table
    HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[NonPartitionedExtractor].getCanonicalName, //needed for non partitioned table
    OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
    HIVE_SYNC_ENABLED_OPT_KEY -> "true",
    HIVE_TABLE_OPT_KEY -> tableName,
    TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,
    "hoodie.bulkinsert.shuffle.parallelism" -> "10")

inputDf.write.format("org.apache.hudi").
    options(bulk_insert_hudiOptions).
    mode(Overwrite).
    save(basePath);
  1. Querying the table in Spark or Hive both work
  2. Querying the table in Presto fails
[hadoop@ip-172-31-128-118 ~]$ presto-cli --catalog hive --schema default
presto:default> select count(*) from my_test_table;

Query 20200211_185123_00018_pruwt, FAILED, 1 node
Splits: 17 total, 0 done (0.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20200211_185123_00018_pruwt failed: Could not find partitionDepth in partition metafile
com.facebook.presto.spi.PrestoException: Could not find partitionDepth in partition metafile
  at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:200)
  at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
  at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
  at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
  at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: Could not find partitionDepth in partition metafile
  at org.apache.hudi.common.model.HoodiePartitionMetadata.getPartitionDepth(HoodiePartitionMetadata.java:75)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:209)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.groupFileStatus(HoodieParquetInputFormat.java:158)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:69)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:371)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:264)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:96)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:193)
  ... 7 more

Expected behavior

Presto should return a count of all the rows. Other Presto queries should succeed.

Environment Description

  • EMR version: emr-5.28.0

  • Hudi version : 0.5.1-incubating, 0.5.0-incubating

  • Spark version : 2.4.4

  • Hive version : 2.3.6

  • Hadoop version : 2.8.5

  • Presto version: 0.227

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Stacktrace

Included in “Steps to reproduce”.

Additional Info When I used one of the columns as a partition column, I was able to query the table in Spark using spark.read.format("org.apache.hudi").load(basePath + "/*"). However, querying it in Hive resulted in:

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1580774559033_0082_2_00, diagnostics=[Vertex vertex_1580774559033_0082_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: my_test_table initializer failed, vertex=vertex_1580774559033_0082_2_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.HoodieHiveUtil.getNthParent(HoodieHiveUtil.java:66)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:313)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:98)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:58)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:71)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1580774559033_0082_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1580774559033_0082_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1580774559033_0082_2_00, diagnostics=[Vertex vertex_1580774559033_0082_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: my_test_table initializer failed, vertex=vertex_1580774559033_0082_2_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.HoodieHiveUtil.getNthParent(HoodieHiveUtil.java:66)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:313)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:98)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:58)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:71)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1580774559033_0082_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1580774559033_0082_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1

Querying it in presto-cli returned 0 rows.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
popartcommented, Feb 26, 2020

I found the problem. We had client-side encryption configured for Spark & Hive using EMRFS, but not for Presto.

0reactions
popartcommented, Feb 24, 2020

Update: This problem does not occur in the docker environment. In the docker demo env, I was able to create a non-partitioned table in Spark (saved to hdfs), use run_sync_tool.sh to sync it to hive, and then query it successfully from presto. (It still made the .hoodie_partition_metadata file though).

Read more comments on GitHub >

github_iconTop Results From Across the Web

[FIXED] Presto cannot query hudi table · Issue #1329 - GitHub
I made a non-partitioned Hudi table using Spark. I was able to query it with Spark & Hive, but when I tried querying...
Read more >
Best practices when using Athena with AWS Glue
When you create schema in AWS Glue to query in Athena, consider the following: A database name cannot be longer than 255 characters....
Read more >
Querying Data - Apache Hudi
Once the proper hudi bundle has been installed, the table can be queried by popular query engines like Hive, Spark SQL, Spark Datasource...
Read more >
E-MapReduce:Release notes for EMR V3.39.X - Alibaba Cloud
The issue that data in Merge on Read tables of Hudi cannot be queried by using a Hudi connector is fixed. You cannot...
Read more >
Presto, Trino, and Athena to Delta Lake integration using ...
Set up the Presto, Trino, or Athena to Delta Lake integration and query Delta tables · Step 1: Generate manifests of a Delta...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found