Explore saving without intermediate representation
See original GitHub issueWe have many reports about memory performance problems. One of the key problems I have seen is that
- user generates workbook - that creates a workbook in our internal structures.
- user calls
XLWorkbook.Save
. This method takes our internal structure and converts them in-memory to documentFormat.OpenXml structure. That takes 1.5x-2x memory compared to ClosedXml structure (at least based on few simplified demos)
We can (and will) improve sice/perfromance our internal structures, but as long as the second part remains, it won’t help much. The library is a key dependency and has been cornerstone of the ClosedXml since the start and is ovioudly out of our control.
This issue is an exploration, if we could just stream sheet part (and only sheet part, rest would still be saved by DocumentFormat.OpenXml) data to xlsx without converting it to documentFormat.OpenXml as an intermediate representation. E.g. linq-to-xml just generates stream without significant memory overhead.
If I can save a valid workbook with rows and cells (no styling no anything, just some values), it might be worth exploring more in depth.
Obviously, this would have several drawbacks
- no validation provided by DocumentFormat.OpenXml
- we would have to completely rework sheet part saving (not sure how many tags or how large it is) - significant possible regression
- a massive amount of work that is now done by DocumentFormat.OpenXml
Resources:
- An example of saving through streaming - https://web.archive.org/web/20160216062257/http://blogs.msdn.com/b/brian_jones/archive/2010/06/22/writing-large-excel-files-with-the-open-xml-sdk.aspx
- FeedData API endpoint (seems too low level in comparisont o streaming, but it exists) - https://stackoverflow.com/questions/11636258/openxml-replace-specific-customxml-part-of-word-document.
- DocumentFormat.OpenXml issue for an example of how to replace a xml subtree in a part - https://github.com/OfficeDev/Open-XML-SDK/issues/566
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:5 (4 by maintainers)
Top GitHub Comments
@Pitterling Thanks. There has been a great news from OpenXML SDK lately (that is the problem with the purple part). They finally added an abstraction for package in https://github.com/OfficeDev/Open-XML-SDK/pull/1295. When I looked at that issue, it has been kicked around for something like 5 years in several repos (BCL basically refused due to breaking changes). Plus there might even be something that takes care of memory issues for me https://github.com/OfficeDev/Open-XML-SDK/issues/807#issuecomment-1377733580 , though I will have to investigate more when I have time.
I have tried it and saving through partial streaming (stream rows data, keep rest in DOM) seems doable (my super dirty exploration code at least saved something). Basic idea is two phase saving:
There will probably be different ways to save individual parts, but for worksheet part:
sheetData
, but that is optional).GenerateWorkbookPart
logic to update DOM to required state, but *don’t addsheetData
element to DOM, other elements will be loaded in the DOM (e.g. sheet properties, tables, and so on) .OpenXmlReader
from the worksheet DOM andOpenXmlWriter
that writes into a stream (basically this), but once we see a sheetData element in the reader stream (that is empty, because we skipped it in previous step to save memory), use XLWorkbook to directly stream from CellCollection to rows of sheetData.This allows us to migrate pieces of DOM saving into SAX stream saving, basically don’t read into the DOM and when saving the DOM, detect a part that should be streamed and stream it. No memory benchmark so far.
There are of course some caveats, but no deal-breaker so far.