Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When reading an xlsx file,file size is 21.3M.An exception occurred: shadeio.poi.util.RecordFormatException

See original GitHub issue

When reading an xlsx file,file size is 21.3M with 5000 rows and 500 columns。 an exception occurred： Caused by: shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 131,181,982, but the maximum length for this record type is 100,000,000.If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() at shadeio.poi.util.IOUtils.throwRFE(IOUtils.java:535) at shadeio.poi.util.IOUtils.checkLength(IOUtils.java:212) at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:177) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted

Expected Behavior

program can read xlsx file to DataFrame.

Current Behavior

If the file size is bigger than 14M,will occure an exception If the file size is samller than 14M,program will run normally.

Possible Solution

setting a higher override value with IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE) as the exception suggested。occured another exception: Caused by: java.io.IOException: MaxLength (100000000) reached - stream seems to be invalid. at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:195) at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:72) at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36) at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:111) at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:127) at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:69) at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:42) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:256) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at com.znv.sentosa.node.ExcelSourceNode.readData_SparkExcel(ExcelSourceNode.java:290) at com.znv.sentosa.node.ExcelSourceNode.readDataset(ExcelSourceNode.java:254) at com.znv.sentosa.component.server.node.DatasetReadNode.getDatasetWithoutCache(DatasetReadNode.java:30) at com.znv.sentosa.component.server.node.DatasetOutNode.getOutputDataset(DatasetOutNode.java:55) … 12 common frames omitted

Steps to Reproduce (for bugs)

```

Dataset<Row> dataSet =spark.read().format(“excel”) .option(“header”,“true”) .option(“dataAddress”,“Sheet1”) .option(“maxRowsInMemory”, 2000) .load(excel_filePath) .toDF(); ```

Context

Does the spark-excel can read an xlsx file size than 14M?

Your Environment

Include as many relevant details about the environment you experienced the bug in

Spark version:3.0.0 java
Spark-Excel version: 3.0.3_0.16.4

Issue Analytics

State:
Created 2 years ago
Comments:8

Top GitHub Comments

2reactions

pjfanningcommented, Mar 4, 2022

spark-excel has PRs to upgrade to POI 5.2.1 that will help.

As a workaround you could try

shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(100_000_000)

spark-excel shades the original poi classes so these config calls need to be adjusted to use the shadeio package instead og org.apache

1reaction

niglcommented, Jul 14, 2022

@pjfanning Thank you for your fast response and quick PR for this issue. I think there must be made some adjustments for V2 version, too. I added my suggestion as a comment in the PR comments.

I applied my changes for a local build and tested it. It worked as expected 👍

Top Results From Across the Web

How to Read, Write XLSX File in Java - Apach POI Example

This error occured mainly because you are trying to read XLSX file but you have only poi.jar in your classpath. Since XLSX is...

How to read excel file using databricks - Feed Detail

shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 208,933,193, but the maximum length for this record type is 100,000,000.

Issues when trying to open an xlsx restricted file: org.apache ...

IRM restricted documents / AD RMS of course can not be opened by Apache POI. This is a microsoft specific feature that needs...

openxlsx: Read, Write and Edit xlsx Files - R Project

Worksheet zoom level as a percentage. header document header. Character vector of length 3 corresponding to positions left, center, right. Use NA to...

spark-excel - bytemeta

When reading an xlsx file,file size is 21.3M.An exception occurred: shadeio.poi.util.RecordFormatException. christianknoepfle. christianknoepfle CLOSED.