Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0)

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

have a table with 100GB data and under compaction
kill the spark job
try to read the data by snapshot query

val df = spark.read.format("org.apache.hudi")
.option("hoodie.datasource.query.type","snapshot")
.load("s3://path_to_data/*")

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.6.0
Spark version : 2.4.4
Hive version : not using
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) : s3
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Exception: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4191
	at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary.decodeToDouble(PlainValuesDictionary.java:208)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:178)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1(HoodieMergeOnReadRDD.scala:239)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1$adapted(HoodieMergeOnReadRDD.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:237)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.hasNext(HoodieMergeOnReadRDD.scala:197)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:244)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:242)
	... 9 more

Issue Analytics

State:
Created 3 years ago
Comments:18 (17 by maintainers)

Top GitHub Comments

1reaction

adaniline-paytmcommented, Jan 22, 2021

I have the same sporadic issue, using standard Spark 2.4.7 distribution and Hudi 0.6:

$ ls -l /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-*
 /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-column-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-common-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-encoding-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-format-2.4.0.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-hadoop-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-jackson-1.10.1.jar

the only workaround we found is to disable VectorizedReader:

      rc.spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

0reactions

vinothchandarcommented, Sep 23, 2021

Closing since the fix has since been landed

Top Results From Across the Web

Older Releases | Apache Hudi

Starting 0.6.0, snapshot queries are feasible on MOR tables using spark datasource. (experimental feature); In prior versions we only supported ...

Work with a Hudi dataset - Amazon EMR - AWS Documentation

Read from a Hudi dataset To retrieve data at the present point in time, Hudi performs snapshot queries by default. Following is an...

[GitHub] [hudi] stackfun opened a new issue #2367

Snapshot Query on MOR Table using spark datasource. **Expected behavior** Snapshot query returns successfully **Environment Description** ...

Newest 'apache-hudi' Questions - Stack Overflow

api.TableException: Unsupported query: Merge Into. I am working on a Flink streaming job where I need to upsert data in the Hudi table....

Use Redshift Spectrum to Query Apache HUDI Copy On Write ...

You can read Copy On Write (CoW) tables in Apache Hudi versions 0.5.2, 0.6.0, and 0.7.0. For more information, see Copy On Write...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] HoodieMultiTableDeltastreamer - Bypassing SchemaProvider-Class requirement for ParquetDFS

[SUPPORT] Slow insert into COW tables with multi level partitions