DF's behaviour is unexpected when using of excel's stream reading
See original GitHub issueHi!
Current Behavior
I am trying to laod .xlsx file into dataframe. The size of the file is about 40Mb. Putting data to DF goes fine but when I invoke any method on df it give me no result for a long time(more than an hour) and I cancel a job.
Steps to Reproduce (for bugs)
the first step looks like this and works succesfully
```
val sparkDF = spark.read.format("com.crealytics.spark.excel")
.option("header", true)
.option("inferSchema", "true")
.option("excerptSize", 100)
.option("maxRowsInMemory", 2)
.option("sheetName", "OSA")
.load("/mnt/databricks/TEST/KPI_list_082020.xlsx")
```
The next step is counting rows: sparkDF.count(),show(). Regardless of maxRowsInMemory parameter - 1 and up to 20, it works the same - I have no results for hours.
Your Environment
I am using Databricks cluster with the following configuration ( Apache Spark 2.4.5, Scala 2.11) Worker Type 14.0 GB Memory, 4 Cores, 0.75 DBU Driver Type 14.0 GB Memory, 4 Cores, 0.75 DBU
File is placed in Azure Blob Storage.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
Top Results From Across the Web
DFS Replication Improvements in Windows Server 2012
The DFS Replication service has detected an unexpected shutdown on ... Initial sync to read-only replicated folders with preexisting data.
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
Default behavior is to infer the column names: if no names are passed the ... Note that the entire file is read into...
Read more >monitorjbl/excel-streaming-reader: An easy-to-use ... - GitHub
Excel Streaming Reader. If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very...
Read more >Unexpected end of ZLIB input stream using Apache POI ...
But the behavior being observed in this question, which appears to be a defect in Apache POI, prevents using File when modifying existing ......
Read more >Errors "An error occurred while communicating with Excel ...
"An error occurred while communicating with Excel Reader. Unable to connect to the Excel file. It might be corrupt. Try opening the file...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@EnverOsmanov , It works for 20 seconds for 40Mb(~600k rows). Thanks a lot!
@vasilnikolay , right. Yes, you can. But it is much easier to use 0.13.5. It is available since my last comment.