question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When reading an xlsx file,file size is 21.3M.An exception occurred: shadeio.poi.util.RecordFormatException

See original GitHub issue

When reading an xlsx file,file size is 21.3M with 5000 rows and 500 columns。 an exception occurred: Caused by: shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 131,181,982, but the maximum length for this record type is 100,000,000.If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() at shadeio.poi.util.IOUtils.throwRFE(IOUtils.java:535) at shadeio.poi.util.IOUtils.checkLength(IOUtils.java:212) at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:177) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted

Expected Behavior

program can read xlsx file to DataFrame.

Current Behavior

If the file size is bigger than 14M,will occure an exception If the file size is samller than 14M,program will run normally.

Possible Solution

setting a higher override value with IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE) as the exception suggested。occured another exception: Caused by: java.io.IOException: MaxLength (100000000) reached - stream seems to be invalid. at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:195) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted

Steps to Reproduce (for bugs)

```

Dataset<Row> dataSet =spark.read().format(“excel”) .option(“header”,“true”) .option(“dataAddress”,“Sheet1”) .option(“maxRowsInMemory”, 2000) .load(excel_filePath) .toDF(); ```

Context

Does the spark-excel can read an xlsx file size than 14M?

Your Environment

Include as many relevant details about the environment you experienced the bug in

  • Spark version:3.0.0 java
  • Spark-Excel version: 3.0.3_0.16.4

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

2reactions
pjfanningcommented, Mar 4, 2022

This is related to https://bz.apache.org/bugzilla/show_bug.cgi?id=65639

spark-excel has PRs to upgrade to POI 5.2.1 that will help.

As a workaround you could try

shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(100_000_000)

spark-excel shades the original poi classes so these config calls need to be adjusted to use the shadeio package instead og org.apache

1reaction
niglcommented, Jul 14, 2022

@pjfanning Thank you for your fast response and quick PR for this issue. I think there must be made some adjustments for V2 version, too. I added my suggestion as a comment in the PR comments.

I applied my changes for a local build and tested it. It worked as expected 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Read, Write XLSX File in Java - Apach POI Example
This error occured mainly because you are trying to read XLSX file but you have only poi.jar in your classpath. Since XLSX is...
Read more >
How to read excel file using databricks - Feed Detail
shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 208,933,193, but the maximum length for this record type is 100,000,000.
Read more >
Issues when trying to open an xlsx restricted file: org.apache ...
IRM restricted documents / AD RMS of course can not be opened by Apache POI. This is a microsoft specific feature that needs...
Read more >
openxlsx: Read, Write and Edit xlsx Files - R Project
Worksheet zoom level as a percentage. header document header. Character vector of length 3 corresponding to positions left, center, right. Use NA to...
Read more >
spark-excel - bytemeta
When reading an xlsx file,file size is 21.3M.An exception occurred: shadeio.poi.util.RecordFormatException. christianknoepfle. christianknoepfle CLOSED.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found