question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser when reading a folder

See original GitHub issue

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Hello,

I am trying to use the last version of the library on a Databricks cluster to read a folder directly in excel format. Here the stacktrace :

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 27) (10.131.243.108 executor 0): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V at com.crealytics.spark.v2.excel.ExcelParser$.parseIterator(ExcelParser.scala:423) at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:75) at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:95) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:826) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1670) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:829) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:684) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2984) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2931) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2925) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2925) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1345) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1345) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1345) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3193) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3134) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3122) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1107) at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2628) at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:266) at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:276) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:81) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:87) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62) at org.apache.spark.sql.execution.ResultCacheManager.collectResult$1(ResultCacheManager.scala:587) at org.apache.spark.sql.execution.ResultCacheManager.computeResult(ResultCacheManager.scala:596) at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:542) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:541) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:438) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:417) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:422) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3132) at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3123) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3930) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:209) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:356) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:160) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:958) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:115) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:306) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3928) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3122) at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:268) at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:102) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$3(ScalaDriverLocal.scala:345) at scala.Option.map(Option.scala:230) at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$1(ScalaDriverLocal.scala:325) at scala.Option.map(Option.scala:230) at com.databricks.backend.daemon.driver.ScalaDriverLocal.getResultBufferInternal(ScalaDriverLocal.scala:289) at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:715) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:267) at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:602) at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:28) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94) at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:26) at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205) at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60) at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240) at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:579) at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615) at scala.util.Try$.apply(Try.scala:213) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V at com.crealytics.spark.v2.excel.ExcelParser$.parseIterator(ExcelParser.scala:423) at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:75) at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:95) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:826) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1670) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:829) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:684) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

No such method is usually linked with a compatibility issue but i have tried some different configuration and always the same problem.

I want to use the library in pyspark but i have tried in scala too and same issue.

It is ok when i am using “com.crealytics.spark.excel” just for one file.

Thanks a lot.

Expected Behavior

No response

Steps To Reproduce

%scala val df = spark.read .format(“excel”) .option(“header”, “true”) .option(“inferSchema”, “true”) .load(“s3://bucket/prefix/”)

display(df)

%python df = (spark .read .format(“excel”) .option(“header”, “true”) .option(“inferSchema”, “true”) .load(“s3://bucket/prefix/”) )

display(df)

Environment

- Spark version: 3.2.1
- Spark-Excel version: 3.2.1_0.17.0 scala 2.12
- OS: Databricks cluster
- Cluster environment : 10.4 LTS Photon

Anything else?

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:16 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
alexjbushcommented, May 3, 2022

This issue looks like the identical issue I get running on Azure Databricks 10.4 described here: https://github.com/crealytics/spark-excel/issues/467#issuecomment-1087075025

Out of interest, I got the FailureSafeParser constructor details in Databricks to compare against the Apache definition:

import scala.reflect.runtime.universe._

def getConstructorParams[T: TypeTag] =
    typeOf[T].decl(termNames.CONSTRUCTOR)
             .alternatives.head.asMethod
             .paramLists.head.map(p => s"${p.asTerm.name.toString}: ${p.asTerm.info.toString} # isParamWithDefault=${p.asTerm.isParamWithDefault}")

getConstructorParams[org.apache.spark.sql.catalyst.util.FailureSafeParser[Vector[String]]]

And gives the result:

import scala.reflect.runtime.universe._
getConstructorParams: [T](implicit evidence$1: reflect.runtime.universe.TypeTag[T])List[String]
res0: List[String] = List(rawParser: IN => Iterable[org.apache.spark.sql.catalyst.InternalRow] # isParamWithDefault=false, mode: org.apache.spark.sql.catalyst.util.ParseMode # isParamWithDefault=false, schema: org.apache.spark.sql.types.StructType # isParamWithDefault=false, columnNameOfCorruptRecord: String # isParamWithDefault=false, filePath: Option[String] # isParamWithDefault=false, debugWriter: Option[com.databricks.sql.catalyst.BadRecordsWriter] # isParamWithDefault=false, columnNameOfRescuedDataOpt: Option[String] # isParamWithDefault=true)

This gives a constructor definition of:

class FailureSafeParser[IN](
    rawParser: IN => Iterable[InternalRow],
    mode: ParseMode,
    schema: StructType,
    columnNameOfCorruptRecord: String,
    filePath: Option[String],
    debugWriter: Option[com.databricks.sql.catalyst.BadRecordsWriter],
    columnNameOfRescuedDataOpt: Option[String] = ?)

Compared to what is expected against Apache Spark: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

class FailureSafeParser[IN](
    rawParser: IN => Iterable[InternalRow],
    mode: ParseMode,
    schema: StructType,
    columnNameOfCorruptRecord: String)

They seem to have added two non-optional arguments and one optional argument to the constructor of FailureSafeParser.

I guess there are a few options here, but I have no idea how realistic they are:

  • Use reflection to instantiate an instance of the class and fill the correct constructor (yuk)
  • Build against the Databricks Spark jars (difficult to source and automate)
  • Use an alternative approach to parsing (is this even possible?)

Any comment on how feasible these options are, and if there are any other alternatives?

1reaction
nightscapecommented, May 5, 2022

Hey @sevann71, great work and follow-up! I’m wondering if the Databricks guys could do something like the Scala compiler team and run the unit tests of “popular” (😁) Spark plugins on their platform. Or maybe use Mima as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark 3.1.2 NoSuchMethodError: org.apache.spark.sql ...
I'm using spark-alchemy below version and calling a function named hll_init_agg inside .agg and getting above error. CODE where it's called:
Read more >
crealytics - Bountysource
I try to read an excel file size 35MB and write the result as orc files ... [BUG] NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser ...
Read more >
Apache Avro Data Source Guide - Spark 3.3.1 Documentation
To load/save data in Avro format, you need to specify the data source option format as avro (or org.apache.spark.sql.avro ). Scala; Java; Python;...
Read more >
How to read excel file using databricks
I'm getting error java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution ...
Read more >
Solved: User class threw exception: org.apache.spark.sql.A...
lang.RuntimeException: java.io.IOException: Unable to create directory /tmp/hive/. Labels:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found