PR Proposal: dataframe.write.format("excel") supports SaveMode.Overwrite/Append and partitioning
See original GitHub issueExpected Behavior
dataframe.write.format(“excel”) supports SaveMode.Overwrite/Append and partitioning in the same way as other data sources like csv
Current Behavior
SaveMode and partitioning doesn’t work as expected, see issue #539 and issue #547
Possible Solution
Despite the comment in ExcelDataSource I suggest to change the definition of ExcelDataSource from
class ExcelDataSource extends TableProvider with DataSourceRegister
to
class ExcelDataSource extends FileDataSourceV2
This allows us to remove most of the copied code from spark. The code is very similar to the spark implementation of the FileDataSourceV2 for csv
In addition we provide a simple implementation for the fallbackFileFormat that just supports writing. So we have
class ExcelFileFormat extends FileFormat with DataSourceRegister {
...
override def prepareWrite ...
...
the prepareWrite() is basically taken from ExcelWriteBuilder()
So as of now the main changes are:
- src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelDataSource.scala
- src/main/3.x/scala/com/crealytics/spark/v2/excel/ExcelFileFormat.scala
You can see all changes here:: https://github.com/christianknoepfle/spark-excel/pull/1/files
There are some very basic unit tests for savemode and partitioning and they work as expected
I only tested it on spark 3.0.3, so I guess some more work needs to be done. Furthermore some better testing and cleanup. The provided unit test are mostly ok, I only get some issues on probably non related stuff: PlainNumberReadSuite , ErrorAsStringsReadSuite (I am using Windows for testing, not sure if this is an explanation)
Please let me know if this makes sense to you and how we can move forward.
Thanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
@nightscape , 3.0.1 and 3.0.3 worked fine on our local servers. I did only little testing but for now that looked ok. Since I am away next week it will take a bit until I can give you some feedback (and if it works with 3.0.1 EMR). From what I understand from here (https://spark.apache.org/versioning-policy.html) spark does not guarantee binary compatibility. They try, but are not ensuring it. I like your idea, but not sure if it works out at the end…
Hi, I did some cleanup on the code, checked the CI and all checks passed. I removed the 3.0.0 from the testing ci and replaced it with 3.0.1. The test for writing a partitioned file structure are only executed on spark >=3.0.1, all other new tests worked fine with the 2.4.x tests (so the whole SaveMode.Overwrite issue was probably broken for spark 3 only)
I opened a PR for the changes and hope that this was ok. If something need change or further test coverage, please let me know. As soon as I can pull spark-excel with that change from maven repo I will try it out in my “weekday professional environment” 😉