spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3
See original GitHub issueSpark 3 added support for working with java.time.LocalDate/Instant instead of java.sql.Date/Timestamp by setting the spark.sql.datetime.java8API.enabled to true. However this doesnt reflect when reading data from excel using-spark excel and that will cause issues.
Expected Behavior
Read date and timestampt using the new api when the flag is enabled
Current Behavior
Dates and timestamps are read using the old API. Exception I ran into is somehting like:
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Date is not a valid external type for schema of date
.....
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, localDateToDays, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, Date), DateType), true, false) AS Date#3531
.....
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
Possible Solution
I tracked down my particular issue here however I’m not sure if there are other places in the sourcode where a fix would be needed. If we can can get our hands on the spark config here we could return either Date or LocalDate
I also noticed there’s a v2.excel package. Is that a new version of spark-excel? Is it ready for use? I couldn’t find anything about it in the docs.
Steps to Reproduce (for bugs)
Read any excel file with dates or timestamps having spark.sql.datetime.java8API.enabled set to true in spark conf
Context
Already covered
Your Environment
- Spark version and language (Scala, Java, Python, R, …): spark 3.1.1 with Java 11
- Spark-Excel version: com.crealytics:spark-excel_2.12:0.14.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (7 by maintainers)
Top GitHub Comments
nice way of putting it 😃 I added the pull request couldn’t figure out where common code between v1 and v2 should reside so I didn’t get the bonus (yet)
Hi @cristichircu
For the java.sql.Date part:
You are right, we can refactor this part. There are no specific reason for this, just couple of though:
About the java8API, let me do some checking and get back to you. Sincerely,