question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explore saving without intermediate representation

See original GitHub issue

We have many reports about memory performance problems. One of the key problems I have seen is that

  • user generates workbook - that creates a workbook in our internal structures.
  • user calls XLWorkbook.Save. This method takes our internal structure and converts them in-memory to documentFormat.OpenXml structure. That takes 1.5x-2x memory compared to ClosedXml structure (at least based on few simplified demos)

We can (and will) improve sice/perfromance our internal structures, but as long as the second part remains, it won’t help much. The library is a key dependency and has been cornerstone of the ClosedXml since the start and is ovioudly out of our control.

This issue is an exploration, if we could just stream sheet part (and only sheet part, rest would still be saved by DocumentFormat.OpenXml) data to xlsx without converting it to documentFormat.OpenXml as an intermediate representation. E.g. linq-to-xml just generates stream without significant memory overhead.

If I can save a valid workbook with rows and cells (no styling no anything, just some values), it might be worth exploring more in depth.

Obviously, this would have several drawbacks

  • no validation provided by DocumentFormat.OpenXml
  • we would have to completely rework sheet part saving (not sure how many tags or how large it is) - significant possible regression
  • a massive amount of work that is now done by DocumentFormat.OpenXml

Resources:

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jahavcommented, Jan 16, 2023

@Pitterling Thanks. There has been a great news from OpenXML SDK lately (that is the problem with the purple part). They finally added an abstraction for package in https://github.com/OfficeDev/Open-XML-SDK/pull/1295. When I looked at that issue, it has been kicked around for something like 5 years in several repos (BCL basically refused due to breaking changes). Plus there might even be something that takes care of memory issues for me https://github.com/OfficeDev/Open-XML-SDK/issues/807#issuecomment-1377733580 , though I will have to investigate more when I have time.

1reaction
jahavcommented, Oct 25, 2022

I have tried it and saving through partial streaming (stream rows data, keep rest in DOM) seems doable (my super dirty exploration code at least saved something). Basic idea is two phase saving:

  • first determine all parts that must be added/removed and assign them RelIds (RelId is the relationship between parts, so before we write parts themselves, RelIds must be assigned). The RelId is currently often assigned in the middle of part writing.
  • Then write individual parts. There should be a separate writer for each part, not a fan of 5k+ super file. Separate out of XLWorkbook.

There will probably be different ways to save individual parts, but for worksheet part:

  • Load DOM worksheet object from the workbook/create new one (if possible, skip sheetData, but that is optional).
  • Use GenerateWorkbookPart logic to update DOM to required state, but *don’t add sheetData element to DOM, other elements will be loaded in the DOM (e.g. sheet properties, tables, and so on) .
  • Create a OpenXmlReader from the worksheet DOM and OpenXmlWriter that writes into a stream (basically this), but once we see a sheetData element in the reader stream (that is empty, because we skipped it in previous step to save memory), use XLWorkbook to directly stream from CellCollection to rows of sheetData.

This allows us to migrate pieces of DOM saving into SAX stream saving, basically don’t read into the DOM and when saving the DOM, detect a part that should be streamed and stream it. No memory benchmark so far.

There are of course some caveats, but no deal-breaker so far.

Read more comments on GitHub >

github_iconTop Results From Across the Web

gcc with parameters "-S -save-temps" puts intermediate ...
2 Answers. In gcc 4.5 you can use the option -save-temps=obj when using the -o option. This saves the intermediate files in the...
Read more >
Lifting up binaries of any arch into an intermediate ...
Capstone : A disassembly framework. Epic: This tool translates binaries of any-arch to arch-independent LLVM bitcode. (This project is not public ...
Read more >
23: Intermediate Representations
Once you become familiar with one, it's not hard to learn others. Intermediate representations are usually categorized according to where.
Read more >
A Simple Graph-Based Intermediate Representation
Abstract. We present a graph-based intermediate representation. (IR) with simple semantics and a low-memory-cost C++ implementation. The IR uses a directed ...
Read more >
I have a plan! Exploring the OPA Intermediate ...
I have a plan! Exploring the OPA Intermediate Representation (IR) format | by Anders Eknert | Open Policy Agent.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found