question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to read from a hudi input?

See original GitHub issue

i have

inputs:
  mydf:
    file:
      path: s3a://xx/a/b/c/

there are partition folders under s3a://xx/a/b/c/ path . and there are hudi parquets under them

i want the mydf to get the partition columns in the df too.

i get

2020-03-25 16:20:27,070 [main] INFO  org.apache.spark.scheduler.DAGScheduler - Job 1 finished: load at FilesInput.scala:29, took 0.073979 s
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
        at com.yotpo.metorikku.input.readers.file.FilesInput.read(FilesInput.scala:29)
        at com.yotpo.metorikku.input.readers.file.FileInput.read(FileInput.scala:15)
        at com.yotpo.metorikku.Job$$anonfun$registerDataframes$1.apply(Job.scala:68)
        at com.yotpo.metorikku.Job$$anonfun$registerDataframes$1.apply(Job.scala:66)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at com.yotpo.metorikku.Job.registerDataframes(Job.scala:66)
        at com.yotpo.metorikku.Job.<init>(Job.scala:48)
        at com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:10)
        at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7)
        at com.yotpo.metorikku.Metorikku.main(Metorikku.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2020-03-25 16:20:27,077 [pool-1-thread-1] INFO  org.apache.spark.SparkContext - Invoking stop() from shutdown hook

also tried format: com.uber.hoodie

2020-03-25 16:36:07,067 [main] INFO  com.yotpo.metorikku.Job - Registering mydf table
Exception in thread "main" com.uber.hoodie.exception.HoodieException: 'path' must be specified.
        at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:57)
        at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:46)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
        at com.yotpo.metorikku.input.readers.file.FilesInput.read(FilesInput.scala:29)
        at com.yotpo.metorikku.input.readers.file.FileInput.read(FileInput.scala:15)
        at com.yotpo.metorikku.Job$$anonfun$registerDataframes$1.apply(Job.scala:68)
        at com.yotpo.metorikku.Job$$anonfun$registerDataframes$1.apply(Job.scala:66)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at com.yotpo.metorikku.Job.registerDataframes(Job.scala:66)
        at com.yotpo.metorikku.Job.<init>(Job.scala:48)
        at com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:10)
        at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7)
        at com.yotpo.metorikku.Metorikku.main(Metorikku.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

also tried --conf spark.sql.hive.convertMetastoreParquet=false same error result.

I have --jars “/home/ec2-user/hoodie-spark-bundle-0.4.6.jar”

these are COW tables, https://hudi.incubator.apache.org/docs/querying_data.html mentions: spark.sparkContext.hadoopConfiguration.setClass(“mapreduce.input.pathFilter.class”, classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);

^^not sure how to set that in metorikku?

if i do s3a://xx/a/b/c/*/*/*.parquet it seems to get further but not sure if? a) right approach, b) will have the partition columns, c) will have dupes in the data?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
lyogevcommented, Apr 7, 2020

What’s the error you’re getting? You may need to do something like: **/*.parquet in the path

0reactions
tooptoop4commented, May 5, 2020

–conf spark.hadoop.mapreduce.input.pathFilter.class=com.uber.hoodie.hadoop.HoodieROTablePathFilter works but it means all inputs (even some non hudi ones) get the pathFilter.class. Do you think metorikku needs a new option like (pathFilterClass per input)? As hudi support seems to be a key feature of metorikku

Read more comments on GitHub >

github_iconTop Results From Across the Web

Querying Data - Apache Hudi
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with...
Read more >
Querying Hudi Tables - Apache Hudi
Once the Hudi tables have been registered to the Hive metastore, it can be queried using the Spark-Hive integration. It supports all query...
Read more >
Querying Data - Apache Hudi
Hudi supports snapshot queries on Copy-On-Write tables & Read Optimized queries on Merge-On-Read tables at the moment, through the initial input format ...
Read more >
Querying Hudi Datasets - Apache Hudi
Read as Hive tables : Supports all three views, including the real time view, relying on the custom Hudi input formats again like...
Read more >
Spark Guide - Apache Hudi
After each write operation we will also show how to read the data both snapshot and ... overwrite the all the partitions that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found