When reading an xlsx file,file size is 21.3M.An exception occurred: shadeio.poi.util.RecordFormatException
See original GitHub issueWhen reading an xlsx file,file size is 21.3M with 5000 rows and 500 columns。 an exception occurred: Caused by: shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 131,181,982, but the maximum length for this record type is 100,000,000.If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() at shadeio.poi.util.IOUtils.throwRFE(IOUtils.java:535) at shadeio.poi.util.IOUtils.checkLength(IOUtils.java:212) at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:177) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted
Expected Behavior
program can read xlsx file to DataFrame.
Current Behavior
If the file size is bigger than 14M,will occure an exception If the file size is samller than 14M,program will run normally.
Possible Solution
setting a higher override value with IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE) as the exception suggested。occured another exception: Caused by: java.io.IOException: MaxLength (100000000) reached - stream seems to be invalid. at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:195) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted
Steps to Reproduce (for bugs)
```
Dataset<Row> dataSet =spark.read().format(“excel”) .option(“header”,“true”) .option(“dataAddress”,“Sheet1”) .option(“maxRowsInMemory”, 2000) .load(excel_filePath) .toDF(); ```
Context
Does the spark-excel can read an xlsx file size than 14M?
Your Environment
Include as many relevant details about the environment you experienced the bug in
- Spark version:3.0.0 java
- Spark-Excel version: 3.0.3_0.16.4
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
Top GitHub Comments
This is related to https://bz.apache.org/bugzilla/show_bug.cgi?id=65639
spark-excel has PRs to upgrade to POI 5.2.1 that will help.
As a workaround you could try
spark-excel shades the original poi classes so these config calls need to be adjusted to use the shadeio package instead og org.apache
@pjfanning Thank you for your fast response and quick PR for this issue. I think there must be made some adjustments for V2 version, too. I added my suggestion as a comment in the PR comments.
I applied my changes for a local build and tested it. It worked as expected 👍