"org.apache.poi.ss.formula.FormulaParseException: Unused input" when saving dataframe
See original GitHub issueExpected Behavior
The dataframe should be saved to S3.
Current Behavior
Saving the dataframe to S3 fails with following exception:
[2019-07-30 10:09:17,405] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/session-context] - Got Throwable org.apache.poi.ss.formula.FormulaParseException: Unused input [MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] after attempting to parse the formula [ 5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:2033) at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:170) at org.apache.poi.xssf.usermodel.XSSFCell.setFormula(XSSFCell.java:566) at org.apache.poi.xssf.usermodel.XSSFCell.setCellFormula(XSSFCell.java:544) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertCell(Model2XlsxConversions.scala:68) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertRow(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at scala.collection.immutable.List.foreach(List.scala:381) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertSheet(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertWorkbook(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxWorkbook.convertAsXlsx(Model2XlsxConversions.scala:557) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxSheet.convertAsXlsx(Model2XlsxConversions.scala:534) at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:38) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:66) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:594) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Possible Solution
The code is probably assuming that a cell contains a formula based on the cell content. If so making it configurable would be a possible solution (so that we can opt out of formula inferring)
Steps to Reproduce (for bugs)
Copying and pasting the value into a csv, loading it into spark dataframe, and trying to save it
5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470
Code used to save the dataframe:
dfToSave .coalesce(1) .write.format("com.crealytics.spark.excel") .option("useHeader", "true") .mode("overwrite") .save(s3path)
Context
User fails to export files from our service
Your Environment
Include as many relevant details about the environment you experienced the bug in
scalaVersion := "2.11.11" sparkVersion := "2.2.0" libraryDependencies ++= Seq( "spark.jobserver" %% "job-server-api" % "0.8.0" % "provided", "spark.jobserver" %% "job-server-extras" % "0.8.0" % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "com.crealytics" %% "spark-excel" % "0.11.0" % "provided", "org.apache.hadoop" % "hadoop-aws" % "2.7.3" % "provided", "com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided" )
Spark standalone mode. Running in AWS inside docker image.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9
Top GitHub Comments
oh, my bad. I mistagged your id. I am sorry.
Unfortunately, I can’t invest time in this… If you can come up with a PR, I’d be happy to review.