question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DF's behaviour is unexpected when using of excel's stream reading

See original GitHub issue

Hi!

Current Behavior

I am trying to laod .xlsx file into dataframe. The size of the file is about 40Mb. Putting data to DF goes fine but when I invoke any method on df it give me no result for a long time(more than an hour) and I cancel a job.

Steps to Reproduce (for bugs)

the first step looks like this and works succesfully
```
val sparkDF = spark.read.format("com.crealytics.spark.excel")
            .option("header", true)
            .option("inferSchema", "true")
             .option("excerptSize", 100)
           .option("maxRowsInMemory", 2)
            .option("sheetName", "OSA")
            .load("/mnt/databricks/TEST/KPI_list_082020.xlsx")
```

The next step is counting rows: sparkDF.count(),show(). Regardless of maxRowsInMemory parameter - 1 and up to 20, it works the same - I have no results for hours.

Your Environment

I am using Databricks cluster with the following configuration ( Apache Spark 2.4.5, Scala 2.11) Worker Type 14.0 GB Memory, 4 Cores, 0.75 DBU Driver Type 14.0 GB Memory, 4 Cores, 0.75 DBU

File is placed in Azure Blob Storage.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
vasilnikolaycommented, Aug 9, 2020

@EnverOsmanov , It works for 20 seconds for 40Mb(~600k rows). Thanks a lot!

1reaction
EnverOsmanovcommented, Aug 9, 2020

@vasilnikolay , right. Yes, you can. But it is much easier to use 0.13.5. It is available since my last comment.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DFS Replication Improvements in Windows Server 2012
The DFS Replication service has detected an unexpected shutdown on ... Initial sync to read-only replicated folders with preexisting data.
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
Default behavior is to infer the column names: if no names are passed the ... Note that the entire file is read into...
Read more >
monitorjbl/excel-streaming-reader: An easy-to-use ... - GitHub
Excel Streaming Reader. If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very...
Read more >
Unexpected end of ZLIB input stream using Apache POI ...
But the behavior being observed in this question, which appears to be a defect in Apache POI, prevents using File when modifying existing ......
Read more >
Errors "An error occurred while communicating with Excel ...
"An error occurred while communicating with Excel Reader. Unable to connect to the Excel file. It might be corrupt. Try opening the file...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found