Error Reading files in Excel Worksheet 97-2003 File - xls format
See original GitHub issueHi Team, I am currently facing an odd error when trying to read xls file using the spark-excel library.
- The source excel file is an xls - 97-2003 file format.
- The file has only one sheet with around 200 odd columns and 40k records.
- With only the basic options set, the file read is successful but takes around 40 to 45 mins.
- But with the option -
Working - Takes 45+ mins to process
df_src = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "0!A1")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("setErrorCellsToFallbackValues","true")
.option("usePlainNumberFormat","true")
.option("inferSchema", "false")
.schema(source_schema)
.load(source_file_oath + source_file_name)
Not Working with maxRowsInMemory option
df_src = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "0!A1")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("setErrorCellsToFallbackValues","true")
.option("usePlainNumberFormat","true")
.option("inferSchema", "false")
.option("maxRowsInMemory", 100)
.schema(source_schema)
.load(source_file_oath + source_file_name)
Error Messge:
shadeio.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
Context
Trying to bring down the time taken to read a large excel file about - 48 MB from 40+ minutes to a lower time frame.
Your Environment
Databricks Environment - LTS Run time
- Spark version: 3.0.1
- Language: Scala
- Spark-Excel version: com.crealytics:spark-excel_2.12:0.13.7
- Operating System and version, cluster environment, …: Ubuntu - Databricks LTS runtime with sufficient memory and disk space.
#62 Look similar - but it is in closed state and no work arounds apart from removing the maxRowsInMemory option.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top Results From Across the Web
Excel 97-2003 Worksheet (.xls) Corrupted - TechNet - Microsoft
To recover as much of the workbook data as possible, click Repair. 2. Open the file, cancel the error and convert it to...
Read more >Error Opening 97-2003 Worksheets in Excel 2016 from O365
Error: "We found a problem with some content in '[file name].xls'. ... allow the workbook to open but all the formatting and links...
Read more >Error to open Microsoft Excel (97-2003) files in Microsoft
xls is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a...
Read more >Worksheet compatibility issues - Microsoft Support
Formulas that reference Timelines will return a #REF! error. What it means Timelines are not supported by the Excel 97-2003 file format (.xls)....
Read more >External table is not in the expected format error reading 97 ...
XLS files your tool/app processes in this way without stopping to try and check if it is a valid Excel file. That way,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @rbharathkumar , please help share your excel file (if it is possible) or utilize some data-generator that can reproduce your performance issue (40min+). Frankly, I don’t know if we can do anything about this, but I am sure will try and report back to you two. Sincerely,
Hi @rbharathkumar
I am finding some time to put my test result to here: https://github.com/crealytics/spark-excel/wiki/Examples:-Resource-Usage-and-How-Big-Spark-Excel-Can-Handle%3F so other people can verify the result.
Going to resolve this ticket. Feel free to reopen it. Thank you so much for your feedback.