Modification to enable producing consistent binary output
See original GitHub issueI think it’d be very useful to have a way to make xlsxwriter
produce identical binary result every time a workbook with identical contents is generated.
Two scenarios I have personally in mind:
- it would be useful to keep some slow-changing data in worksheets in version control without wasting space every time a report is regenerated,
- it would be helpful to quickly check binary hash of a report to verify the same contents were generated as a kind of unit test
Unfortunately, that’s currently not possible. I’ve seen that there are two moving pieces here:
-
one can set constant creation/modification date that is written to the metadata using “document properties”, but
-
the zipping process always sets file timestamps to current time what results in the archive being different on every run.
I’ve tracked down the problem to the zipping library which internally makes use of timestamps of temporary files created by xlsxwriter
or current time when in-memory mode is used.
In my pull request I propose a solution where creation time is taken from the metadata (if available) and:
-
a
ZipInfo
structure with that date is used for in-memory zipping -
temporary files’ modification time is set to that date with file-based zipping.
All of that can be done in Workbook
’s _store_workbook
method. I wondered whether setting the temporary files’ date in Packager
would be more natural but decided it’s better to have it all in one place for both file-based and in-memory modes.
I think the enhancement would be useful not only to me (if properly advertised) but I tried to keep a low profile and “constant output mode” is only activated when creation date is set in properties which I expect would be rare.
For implementation details and usage code see my pull request https://github.com/jmcnamara/XlsxWriter/pull/495.
All existing unit tests passed (linux, py27). I’ve seen your encouragement for creating new ones for any additional functionality but all existing tests check internal xml data I haven’t spotted a natural place to add an “outside” binary-level test.
Last but not least, thanks for your great library, I use it to all kind of things and am glad I can give something back.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
Thanks, you’re right, it was stupid of me not to check Excel. What’s more setting a constant is even simpler and cleaner, I’ve updated the pull request accordingly.
As I checked in the meantime, LibreOffice and Google Docs are not Excel-compliant 😄 they have current time embedded as zipped files’ timestamp which makes the files different on every save.
I’ve been worried as I realized there could be a side effect of changing timestamps on some non-temp files if such were used in the zipping but doesn’t seem to happen. I suspected embedding external image could reuse the original file for example but it looks it doesn’t.
Now your stackoverflow answer you mentioned would become true, setting the “created” property alone would make both files identical without depending on a race condition (that these two files are created in such rapid succession that tempfiles get the same filesystem timestamp) 😄
Happy Easter!
Indeed! 😃