Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3

See original GitHub issue

Spark 3 added support for working with java.time.LocalDate/Instant instead of java.sql.Date/Timestamp by setting the spark.sql.datetime.java8API.enabled to true. However this doesnt reflect when reading data from excel using-spark excel and that will cause issues.

Expected Behavior

Read date and timestampt using the new api when the flag is enabled

Current Behavior

Dates and timestamps are read using the old API. Exception I ran into is somehting like:

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Date is not a valid external type for schema of date
.....
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, localDateToDays, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, Date), DateType), true, false) AS Date#3531
.....
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

Possible Solution

I tracked down my particular issue here however I’m not sure if there are other places in the sourcode where a fix would be needed. If we can can get our hands on the spark config here we could return either Date or LocalDate

I also noticed there’s a v2.excel package. Is that a new version of spark-excel? Is it ready for use? I couldn’t find anything about it in the docs.

Steps to Reproduce (for bugs)

Read any excel file with dates or timestamps having spark.sql.datetime.java8API.enabled set to true in spark conf

Context

Already covered

Your Environment

Spark version and language (Scala, Java, Python, R, …): spark 3.1.1 with Java 11
Spark-Excel version: com.crealytics:spark-excel_2.12:0.14.0

Issue Analytics

State:
Created 2 years ago
Comments:13 (7 by maintainers)

Top GitHub Comments

1reaction

cristichircucommented, Dec 21, 2021

nice way of putting it 😃 I added the pull request couldn’t figure out where common code between v1 and v2 should reside so I didn’t get the bonus (yet)

1reaction

quanghgxcommented, Dec 10, 2021

Hi @cristichircu

For the java.sql.Date part:

DateTimeUtils.fromJavaDate(new java.sql.Date(datum.getDateCellValue().getTime))

You are right, we can refactor this part. There are no specific reason for this, just couple of though:

in V2, we do prefer reusing spark’s facility as much as possible. It happened to me that DateTimeUtils has a method that can be used.
For this particular case, we can just divide the cell getTime value to a certain constant and return.

About the java8API, let me do some checking and get back to you. Sincerely,

Top Results From Across the Web

Migration Guide: SQL, Datasets and DataFrame - Apache Spark

Since Spark 3.0, configuration spark.sql.crossJoin.enabled become internal configuration, and is true by default, so by default spark won't raise exception ...

How to Effectively Use Dates and Timestamps in Spark 3.0

If we set the SQL config spark.sql.datetime.java8API.enabled to true, the Dataset.collect() action will return: java.time.LocalDate ...

crealytics - Bountysource

spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3 $ 0 ... Created 1 year ago in crealytics/spark-excel with 5 ......

spark.sql.datetime.java8API.enabled=true doesn't map ...

spark.sql.datetime.java8API.enabled=true doesn't map dateType to java.util.LocalDate by Encoders.bean(Association.class).

Updating to Spark 3.0 in production | by Louis Fruleux - Medium

... to take into account to make our code compile with Spark 3 and go ... the new behavior) by setting the configuration...