Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PR Proposal: dataframe.write.format("excel") supports SaveMode.Overwrite/Append and partitioning

See original GitHub issue

Expected Behavior

dataframe.write.format(“excel”) supports SaveMode.Overwrite/Append and partitioning in the same way as other data sources like csv

Current Behavior

SaveMode and partitioning doesn’t work as expected, see issue #539 and issue #547

Possible Solution

Despite the comment in ExcelDataSource I suggest to change the definition of ExcelDataSource from

class ExcelDataSource extends TableProvider with DataSourceRegister

class ExcelDataSource extends FileDataSourceV2

This allows us to remove most of the copied code from spark. The code is very similar to the spark implementation of the FileDataSourceV2 for csv

In addition we provide a simple implementation for the fallbackFileFormat that just supports writing. So we have

class ExcelFileFormat extends FileFormat with DataSourceRegister {
...
  override def prepareWrite ...
...

the prepareWrite() is basically taken from ExcelWriteBuilder()

So as of now the main changes are:

src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelFileFormat.scala

You can see all changes here:: https://github.com/christianknoepfle/spark-excel/pull/1/files

There are some very basic unit tests for savemode and partitioning and they work as expected

I only tested it on spark 3.0.3, so I guess some more work needs to be done. Furthermore some better testing and cleanup. The provided unit test are mostly ok, I only get some issues on probably non related stuff: PlainNumberReadSuite , ErrorAsStringsReadSuite (I am using Windows for testing, not sure if this is an explanation)

Please let me know if this makes sense to you and how we can move forward.

Thanks

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

christianknoepflecommented, Apr 27, 2022

@nightscape , 3.0.1 and 3.0.3 worked fine on our local servers. I did only little testing but for now that looked ok. Since I am away next week it will take a bit until I can give you some feedback (and if it works with 3.0.1 EMR). From what I understand from here (https://spark.apache.org/versioning-policy.html) spark does not guarantee binary compatibility. They try, but are not ensuring it. I like your idea, but not sure if it works out at the end…

1reaction

christianknoepflecommented, Mar 6, 2022

Hi, I did some cleanup on the code, checked the CI and all checks passed. I removed the 3.0.0 from the testing ci and replaced it with 3.0.1. The test for writing a partitioned file structure are only executed on spark >=3.0.1, all other new tests worked fine with the 2.4.x tests (so the whole SaveMode.Overwrite issue was probably broken for spark 3 only)

I opened a PR for the changes and hope that this was ok. If something need change or further test coverage, please let me know. As soon as I can pull spark-excel with that change from maven repo I will try it out in my “weekday professional environment” 😉

Top Results From Across the Web

The logs for this run have expired and are no longer available.

A Spark plugin for reading and writing Excel files - PR Proposal: dataframe.write.format("excel") supports SaveMode.Overwrite/Append and partitioning ...

Spark - Overwrite the output directory

Spark /PySpark by default doesn't overwrite the output directory on S3, HDFS, or any other file systems, when you try to write the...

Use Excel with earlier versions of Excel - Microsoft Support

xltm), the workbook is saved in Excel 97-2003 file format (.xls), a file format that can be opened by earlier versions of Excel....

How to overwrite the output directory in spark - Stack Overflow

UPDATE: Suggest using Dataframes , plus something like ... .write.mode(SaveMode.Overwrite) ... . Handy pimp:

pandas.ExcelWriter — pandas 1.5.2 documentation

Class for writing DataFrame objects into excel sheets. ... package is no longer maintained, the xlwt engine will be removed in a future...