question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0)

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. have a table with 100GB data and under compaction
  2. kill the spark job
  3. try to read the data by snapshot query
val df = spark.read.format("org.apache.hudi")
.option("hoodie.datasource.query.type","snapshot")
.load("s3://path_to_data/*")

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.6.0

  • Spark version : 2.4.4

  • Hive version : not using

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : s3

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Exception: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4191
	at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary.decodeToDouble(PlainValuesDictionary.java:208)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:178)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1(HoodieMergeOnReadRDD.scala:239)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1$adapted(HoodieMergeOnReadRDD.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:237)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.hasNext(HoodieMergeOnReadRDD.scala:197)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:244)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:242)
	... 9 more

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
adaniline-paytmcommented, Jan 22, 2021

I have the same sporadic issue, using standard Spark 2.4.7 distribution and Hudi 0.6:

$ ls -l /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-*
 /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-column-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-common-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-encoding-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-format-2.4.0.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-hadoop-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-jackson-1.10.1.jar

the only workaround we found is to disable VectorizedReader:

      rc.spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
0reactions
vinothchandarcommented, Sep 23, 2021

Closing since the fix has since been landed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Older Releases | Apache Hudi
Starting 0.6.0, snapshot queries are feasible on MOR tables using spark datasource. (experimental feature); In prior versions we only supported ...
Read more >
Work with a Hudi dataset - Amazon EMR - AWS Documentation
Read from a Hudi dataset​​ To retrieve data at the present point in time, Hudi performs snapshot queries by default. Following is an...
Read more >
[GitHub] [hudi] stackfun opened a new issue #2367
Snapshot Query on MOR Table using spark datasource. **Expected behavior** Snapshot query returns successfully **Environment Description** ...
Read more >
Newest 'apache-hudi' Questions - Stack Overflow
api.TableException: Unsupported query: Merge Into. I am working on a Flink streaming job where I need to upsert data in the Hudi table....
Read more >
Use Redshift Spectrum to Query Apache HUDI Copy On Write ...
You can read Copy On Write (CoW) tables in Apache Hudi versions 0.5.2, 0.6.0, and 0.7.0. For more information, see Copy On Write...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found