Add validator (Nonconformance to Office Open XML schema)
See original GitHub issueEDIT: The actionable thing to do here is add a javascript validator against one of the wml.xsd schemas.
==========
Hey there! Great library, I’ve been using it a while and trying to help out a little where I can.
An issue I’ve come across is that it’s really easy to generate a corrupted document, and tricky to pinpoint exactly where and why this happens. It’s not the fault of this library, and to be honest there aren’t any good options for validating these XML documents in javascript (I’m working on this – soon I hope to have a validator with specific error messages in pure js!).
I’ve set up a hacky tool to validate these documents locally on linux with libxml/xmllint – I’ll share that setup once I write a little wrapper around it – and I’ve noticed that it spits out a ton of errors. One of the most common errors is that, in several places in the spec, there’s a specific sequence of nodes expected in order to conform. Nodes being out of order mostly works, but I suspect it’s caused some bugs!
See for example, the schema for the base abstract type used under w:pPr
- notice the <xsd:sequence>
- this means they must be in this specific order to conform.
<xsd:complexType name="CT_PPrBase">
<xsd:sequence>
<xsd:element name="pStyle" type="CT_String" minOccurs="0"/>
<xsd:element name="keepNext" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="keepLines" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="pageBreakBefore" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="framePr" type="CT_FramePr" minOccurs="0"/>
<xsd:element name="widowControl" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="numPr" type="CT_NumPr" minOccurs="0"/>
<xsd:element name="suppressLineNumbers" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="pBdr" type="CT_PBdr" minOccurs="0"/>
<xsd:element name="shd" type="CT_Shd" minOccurs="0"/>
<xsd:element name="tabs" type="CT_Tabs" minOccurs="0"/>
<xsd:element name="suppressAutoHyphens" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="kinsoku" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="wordWrap" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="overflowPunct" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="topLinePunct" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="autoSpaceDE" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="autoSpaceDN" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="bidi" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="adjustRightInd" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="snapToGrid" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="spacing" type="CT_Spacing" minOccurs="0"/>
<xsd:element name="ind" type="CT_Ind" minOccurs="0"/>
<xsd:element name="contextualSpacing" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="mirrorIndents" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="suppressOverlap" type="CT_OnOff" minOccurs="0"/>
<xsd:element name="jc" type="CT_Jc" minOccurs="0"/>
<xsd:element name="textDirection" type="CT_TextDirection" minOccurs="0"/>
<xsd:element name="textAlignment" type="CT_TextAlignment" minOccurs="0"/>
<xsd:element name="textboxTightWrap" type="CT_TextboxTightWrap" minOccurs="0"/>
<xsd:element name="outlineLvl" type="CT_DecimalNumber" minOccurs="0"/>
<xsd:element name="divId" type="CT_DecimalNumber" minOccurs="0"/>
<xsd:element name="cnfStyle" type="CT_Cnf" minOccurs="0" maxOccurs="1"/>
</xsd:sequence>
</xsd:complexType>
Sadly, this is not documented anywhere in the officeopenxml.com site, and is only found in the ECMA-376 reference schemas (see for example, ECMA-376 fifth edition, part one, page 3839, containing a version of the above element type).
https://www.ecma-international.org/publications-and-standards/standards/ecma-376/
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
I made quite a bit of progress on this, here: https://github.com/dolanmiu/docx/compare/master...devoidfury:bug/ooxml-conformance-fixes
The main errors I’m getting that I don’t know how to handle:
InvalidEDIT: this has been removed on my branch.mirrorMargins
attribute onw:pgMar
Invalid elementEDIT: this has been removed in my branch.w:shdCs
(couldn’t find a reference for these anywhere – should it just be deleted? Looks likew:shd
does everything here)This is written about here, and it’s a commonly used attribute among various XML document types: http://www.wordarticles.com/Articles/Formats/OOXML/OOXML.phpw:document
has an invalid attributemc:Ignorable="w14 w15 wp14"
, couldn’t find a reference or documentation for this property anywhere.@devoidfury I am adding it into GitHub Actions
Thank you for your research into this area
The checks are based on the same OOXML schemas on your docx-validator project:
https://github.com/dolanmiu/docx/pull/1202