question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error Reading files in Excel Worksheet 97-2003 File - xls format

See original GitHub issue

Hi Team, I am currently facing an odd error when trying to read xls file using the spark-excel library.

  1. The source excel file is an xls - 97-2003 file format.
  2. The file has only one sheet with around 200 odd columns and 40k records.
  3. With only the basic options set, the file read is successful but takes around 40 to 45 mins.
  4. But with the option -

Working - Takes 45+ mins to process

df_src = spark.read
  .format("com.crealytics.spark.excel")
  .option("dataAddress", "0!A1")
  .option("header", "true")
  .option("treatEmptyValuesAsNulls", "true")
  .option("setErrorCellsToFallbackValues","true")
  .option("usePlainNumberFormat","true")
  .option("inferSchema", "false")
  .schema(source_schema)
  .load(source_file_oath + source_file_name)

Not Working with maxRowsInMemory option

df_src = spark.read
  .format("com.crealytics.spark.excel")
  .option("dataAddress", "0!A1")
  .option("header", "true")
  .option("treatEmptyValuesAsNulls", "true")
  .option("setErrorCellsToFallbackValues","true")
  .option("usePlainNumberFormat","true")
  .option("inferSchema", "false")
  .option("maxRowsInMemory", 100)
  .schema(source_schema)
  .load(source_file_oath + source_file_name)

Error Messge:

shadeio.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)

Context

Trying to bring down the time taken to read a large excel file about - 48 MB from 40+ minutes to a lower time frame.

Your Environment

Databricks Environment - LTS Run time

  • Spark version: 3.0.1
  • Language: Scala
  • Spark-Excel version: com.crealytics:spark-excel_2.12:0.13.7
  • Operating System and version, cluster environment, …: Ubuntu - Databricks LTS runtime with sufficient memory and disk space.

#62 Look similar - but it is in closed state and no work arounds apart from removing the maxRowsInMemory option.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

2reactions
quanghgxcommented, Jun 27, 2021

Hi @rbharathkumar , please help share your excel file (if it is possible) or utilize some data-generator that can reproduce your performance issue (40min+). Frankly, I don’t know if we can do anything about this, but I am sure will try and report back to you two. Sincerely,

0reactions
quanghgxcommented, Dec 5, 2021

Hi @rbharathkumar

  1. Given that https://github.com/crealytics/spark-excel/pull/421 is merged. Thanks to @pjfanning
  2. And from my local testing, spark-excel (V2), resource consumption (memory & CPU) are similar to Apache POI. Spark-excel, as it runs on Spark with its lazy loading, if we load multiple files, and rows are processed in pipeline, Spark-excel will consume much less than the memory needed for all excel file combined.

I am finding some time to put my test result to here: https://github.com/crealytics/spark-excel/wiki/Examples:-Resource-Usage-and-How-Big-Spark-Excel-Can-Handle%3F so other people can verify the result.

Going to resolve this ticket. Feel free to reopen it. Thank you so much for your feedback.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Excel 97-2003 Worksheet (.xls) Corrupted - TechNet - Microsoft
To recover as much of the workbook data as possible, click Repair. 2. Open the file, cancel the error and convert it to...
Read more >
Error Opening 97-2003 Worksheets in Excel 2016 from O365
Error: "We found a problem with some content in '[file name].xls'. ... allow the workbook to open but all the formatting and links...
Read more >
Error to open Microsoft Excel (97-2003) files in Microsoft
xls is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a...
Read more >
Worksheet compatibility issues - Microsoft Support
Formulas that reference Timelines will return a #REF! error. What it means Timelines are not supported by the Excel 97-2003 file format (.xls)....
Read more >
External table is not in the expected format error reading 97 ...
XLS files your tool/app processes in this way without stopping to try and check if it is a valid Excel file. That way,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found