question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3

See original GitHub issue

Spark 3 added support for working with java.time.LocalDate/Instant instead of java.sql.Date/Timestamp by setting the spark.sql.datetime.java8API.enabled to true. However this doesnt reflect when reading data from excel using-spark excel and that will cause issues.

Expected Behavior

Read date and timestampt using the new api when the flag is enabled

Current Behavior

Dates and timestamps are read using the old API. Exception I ran into is somehting like:

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Date is not a valid external type for schema of date
.....
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, DateType, localDateToDays, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, Date), DateType), true, false) AS Date#3531
.....
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:213)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:195)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

Possible Solution

I tracked down my particular issue here however I’m not sure if there are other places in the sourcode where a fix would be needed. If we can can get our hands on the spark config here we could return either Date or LocalDate

I also noticed there’s a v2.excel package. Is that a new version of spark-excel? Is it ready for use? I couldn’t find anything about it in the docs.

Steps to Reproduce (for bugs)

Read any excel file with dates or timestamps having spark.sql.datetime.java8API.enabled set to true in spark conf

Context

Already covered

Your Environment

  • Spark version and language (Scala, Java, Python, R, …): spark 3.1.1 with Java 11
  • Spark-Excel version: com.crealytics:spark-excel_2.12:0.14.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
cristichircucommented, Dec 21, 2021

nice way of putting it 😃 I added the pull request couldn’t figure out where common code between v1 and v2 should reside so I didn’t get the bonus (yet)

1reaction
quanghgxcommented, Dec 10, 2021

Hi @cristichircu

For the java.sql.Date part:

DateTimeUtils.fromJavaDate(new java.sql.Date(datum.getDateCellValue().getTime))

You are right, we can refactor this part. There are no specific reason for this, just couple of though:

  • in V2, we do prefer reusing spark’s facility as much as possible. It happened to me that DateTimeUtils has a method that can be used.
  • For this particular case, we can just divide the cell getTime value to a certain constant and return.

About the java8API, let me do some checking and get back to you. Sincerely,

Read more comments on GitHub >

github_iconTop Results From Across the Web

Migration Guide: SQL, Datasets and DataFrame - Apache Spark
Since Spark 3.0, configuration spark.sql.crossJoin.enabled become internal configuration, and is true by default, so by default spark won't raise exception ...
Read more >
How to Effectively Use Dates and Timestamps in Spark 3.0
If we set the SQL config spark.sql.datetime.java8API.enabled to true, the Dataset.collect() action will return: java.time.LocalDate ...
Read more >
crealytics - Bountysource
spark-excel doesn't take into account the 'spark.sql.datetime.java8API.enabled' conf added in spark 3 $ 0 ... Created 1 year ago in crealytics/spark-excel with 5 ......
Read more >
spark.sql.datetime.java8API.enabled=true doesn't map ...
spark.sql.datetime.java8API.enabled=true doesn't map dateType to java.util.LocalDate by Encoders.bean(Association.class).
Read more >
Updating to Spark 3.0 in production | by Louis Fruleux - Medium
... to take into account to make our code compile with Spark 3 and go ... the new behavior) by setting the configuration...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found