question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] EMR 6.0 hudi-spark-bundle 0.6.0 is not able to query data, two problems

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

First problem

scala> val df2 = spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/york_test/york_compaction_test2/*")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
  at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$1(MergeOnReadSnapshotRelation.scala:145)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.Iterator.foreach(Iterator.scala:941)
  at scala.collection.Iterator.foreach$(Iterator.scala:941)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:142)
  at org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:87)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:51)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

Second Problem

A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.ClassLoader.defineClass1(Native Method)

As I checked with org.apache.hudi.hadoop.HoodieParquetInputFormat, below imports are not solved in intellij

import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
import org.apache.hadoop.hive.ql.io.sarg.ConvertAstToSearchArg;
import org.apache.hadoop.hive.ql.plan.TableScanDesc;

To Reproduce

Steps to reproduce the behavior of first problem 1.

spark-shell --master yarn   --executor-cores 2   --packages org.apache.hudi:hudi-spark-bundle_2.12:0.6.0,org.apache.spark:spark-avro_2.12:2.4.4,org.apache.avro:avro:1.8.2   --driver-memory 1G   --executor-memory 1G   --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"   --conf "spark.dynamicAllocation.minExecutors=1"   --conf "spark.dynamicAllocation.maxExecutors=1"   --conf "spark.executor.instances=1"   --conf spark.sql.hive.convertMetastoreParquet=false
import java.sql.Timestamp
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.model.HoodieCleaningPolicy
import org.apache.hudi.config.HoodieCompactionConfig._
import org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import spark.implicits._


val tableName = "york_compaction_test2"
val hudiOptions = Map[String,String](
      TABLE_NAME -> tableName,
      PRECOMBINE_FIELD_OPT_KEY -> "timestamp_column",
      RECORDKEY_FIELD_OPT_KEY -> "id",
      PARTITIONPATH_FIELD_OPT_KEY -> "date_string",
      TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL,
      CLEANER_INCREMENTAL_MODE -> "true",
      CLEANER_POLICY_PROP -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name(),
      CLEANER_FILE_VERSIONS_RETAINED_PROP -> "2",
      "hoodie.compaction.strategy" -> "org.apache.hudi.table.compact.strategy.UnBoundedCompactionStrategy",
      "hoodie.upsert.shuffle.parallelism" -> "50",
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "5",
      "hoodie.filesystem.view.incr.timeline.sync.enable" -> "true",
      HIVE_URL_OPT_KEY -> "jdbc:hive2://ip-10-221-36-98.ap-northeast-1.compute.internal:10000",
      HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
      HIVE_DATABASE_OPT_KEY -> "default", 
      HIVE_TABLE_OPT_KEY -> tableName,
      HIVE_PARTITION_FIELDS_OPT_KEY -> "date_string",
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
    )
val dateSMap: Map[Int, String] = Map(
    0-> "2020-07",
    1-> "2020-08",
    2-> "2020-09",
)
val dateMap: Map[Int, Timestamp] = Map(
    0-> Timestamp.valueOf("2010-07-01 11:00:15"),
    1-> Timestamp.valueOf("2010-08-01 11:00:15"),
    2-> Timestamp.valueOf("2010-09-01 11:00:15"),
)
var seq = Seq(
    (0, "value", dateMap(0), dateSMap(0))
)
for(i <- 1 to 5000) {
    seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3))
}

val df = seq.toDF("id", "string_column", "timestamp_column", "date_string").withColumn("timestamp_column", date_format(col("timestamp_column"), "yyyy-MM-dd HH:mm:ss"))

df.write.format("hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(s"s3a://daas-hudi-test/york_test/$tableName")

val df2 = spark.read.format("org.apache.hudi").load("s3://daas-hudi-test/york_test/york_compaction_test2/*")

Steps to reproduce the behavior of second problem

  1. pom
val sparkVersion = "2.4.4"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % Provided,
  "org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
  "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion,

  // Needed for Hudi
  // NB: we use a version of Hudi that is published locally and that contains a patch which fixes compaction:
  // https://github.com/apache/hudi/pull/1857
  "org.apache.hudi" %% "hudi-spark-bundle" % "0.6.0" % Provided,
  "org.apache.avro" % "avro" % "1.8.2" % Provided,
  "org.apache.spark" %% "spark-avro" % sparkVersion % Provided,
)
  1. code
val tblName: String = "dummy"
  val viewName:String = tblName + "_view"
  val primaryKey: String = "id"
  val preCombineKey: String = Constants.TS_FIELD
  val partitionField: String = "age"
  val basePath: String = "/tmp/hudi-tests/"
  val hudiOptions: Map[String, String] = Map(
    TABLE_NAME -> tblName,
    TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL,
    RECORDKEY_FIELD_OPT_KEY -> primaryKey,
    PRECOMBINE_FIELD_OPT_KEY -> preCombineKey,
    PARTITIONPATH_FIELD_OPT_KEY -> partitionField,
    "hoodie.insert.shuffle.parallelism" -> "1",
    "hoodie.bulkinsert.shuffle.parallelism" -> "1",
    "hoodie.upsert.shuffle.parallelism" -> "1",
    "hoodie.datasource.compaction.async.enable" -> "true",
    "hoodie.compact.inline.max.delta.commits" -> "1"
  )

   spark = SparkSession.builder()
      .master("local[*]")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("spark.sql.shuffle.partitions", "1")
      .config("spark.ui.enabled", "false")
      .appName("hudi-processing-test")
      .getOrCreate()
    spark.sparkContext.setLogLevel("warn")

    import spark.implicits._

    val df = Seq (
      (1, "Bob", 10, Timestamp.valueOf("2020-06-27 00:00:00"), 100.toLong, BinlogEventType.Insert.value),
      (1, "BobTheBuilder", 10, Timestamp.valueOf("2020-06-27 00:00:00"), 101.toLong, BinlogEventType.Update.value)
    ).toDF("id", "name", "age", "created_at", Constants.TS_FIELD, Constants.EVENT_TYPE_FIELD)

    df
      .write
      .format("hudi")
      .options(hudiOptions + ((PAYLOAD_CLASS_OPT_KEY, EventsWriter.DEFAULT_PAYLOAD_CLASS)))
      .mode(SaveMode.Append)
      .save(basePath)

    spark.read
      .format("hudi")
      .load(basePath + "/*/")
      .createOrReplaceTempView(viewName)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.6.0

  • Spark version : 2.4.4

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : NO

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
zherenyu831commented, Aug 31, 2020

@bvaradar I got the reason We are using EMR 6.0, in the spark-sql_2.12.2.4.4 The interface of org.apache.spark.sql.execution.datasources.PartitionedFile is different from official one

EMR

public PartitionedFile(InternalRow partitionValues, String filePath, long start, long length, String[] locations, FileData data)

Offcial

public PartitionedFile(InternalRow partitionValues, String filePath, long start, long length, String[] locations)

work around is

  ShadeRule.rename("org.apache.spark.sql.execution.datasources.**" -> "local.spark.@1")
    .inLibrary(
      "org.apache.hudi" % "hudi-spark-bundle_2.12" % "0.6.0",
      "org.apache.spark" % "spark-sql_2.12" % "2.4.4")
    .inProject

or replace spark-sql_2.12-2.4.4 with official one on emr

cd /usr/lib/spark/jars/
sudo mv spark-sql_2.12-2.4.4.jar spark-sql_2.12-2.4.4.jar.1
sudo wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12/2.4.4/spark-sql_2.12-2.4.4.jar
1reaction
bvaradarcommented, Aug 26, 2020

Regarding second problem, can you use Hive 2.x (e.g: 2.3.3) version. This is what hudi is compiled with.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Install Hudi ver. 0.6.0 on AWS EMR - Stack Overflow
The version of Hudi installed with Amazon EMR 5.30.1 is 0.5.2-incubating, it means that if you want to use version 0.6.0 you have...
Read more >
Amazon EMR release 6.0.0
Lists application versions, release notes, component versions, and configuration classifications available in Amazon EMR 6.0.0.
Read more >
Data Lake Change Data Capture (CDC) using Apache Hudi ...
Data Lake Change Data Capture (CDC) using Apache Hudi on Amazon EMR — Part 2—Process. Easily process data changes over time from your...
Read more >
Amazon EMR | Noise | Page 6
The Deequ library does not support tables with nested data (such as JSON). ... The post also provides examples of how to query...
Read more >
Amazon EMR Migration Guide - Awsstatic
For many customers, migrating to Amazon EMR raises many questions about ... Figure 2: Queries on partitioned and non-partitioned data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found