Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DF's behaviour is unexpected when using of excel's stream reading

See original GitHub issue

Hi!

Current Behavior

I am trying to laod .xlsx file into dataframe. The size of the file is about 40Mb. Putting data to DF goes fine but when I invoke any method on df it give me no result for a long time(more than an hour) and I cancel a job.

Steps to Reproduce (for bugs)

the first step looks like this and works succesfully
```
val sparkDF = spark.read.format("com.crealytics.spark.excel")
            .option("header", true)
            .option("inferSchema", "true")
             .option("excerptSize", 100)
           .option("maxRowsInMemory", 2)
            .option("sheetName", "OSA")
            .load("/mnt/databricks/TEST/KPI_list_082020.xlsx")
```

The next step is counting rows: sparkDF.count(),show(). Regardless of maxRowsInMemory parameter - 1 and up to 20, it works the same - I have no results for hours.

Your Environment

I am using Databricks cluster with the following configuration ( Apache Spark 2.4.5, Scala 2.11) Worker Type 14.0 GB Memory, 4 Cores, 0.75 DBU Driver Type 14.0 GB Memory, 4 Cores, 0.75 DBU

File is placed in Azure Blob Storage.

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

1reaction

vasilnikolaycommented, Aug 9, 2020

@EnverOsmanov , It works for 20 seconds for 40Mb(~600k rows). Thanks a lot!

1reaction

EnverOsmanovcommented, Aug 9, 2020

@vasilnikolay , right. Yes, you can. But it is much easier to use 0.13.5. It is available since my last comment.

Read more comments on GitHub >

Top Results From Across the Web

DFS Replication Improvements in Windows Server 2012

The DFS Replication service has detected an unexpected shutdown on ... Initial sync to read-only replicated folders with preexisting data.

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

Default behavior is to infer the column names: if no names are passed the ... Note that the entire file is read into...

monitorjbl/excel-streaming-reader: An easy-to-use ... - GitHub

Excel Streaming Reader. If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very...

Unexpected end of ZLIB input stream using Apache POI ...

But the behavior being observed in this question, which appears to be a defect in Apache POI, prevents using File when modifying existing ......

Errors "An error occurred while communicating with Excel ...

"An error occurred while communicating with Excel Reader. Unable to connect to the Excel file. It might be corrupt. Try opening the file...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Can not parse Date

Error reading xlsx file (MIN_INFLATE_RATIO exceeded)