Allow access to Word field codes via instrText
See original GitHub issueI’d like to use mammoth.js to extract bibliographic metadata from references put in with the Zotero reference manager. These references are usually encoded as fields, and their field code looks like:
<w:r>
<w:instrText xml:space="preserve">
ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"7kotdh7ut","properties":{"formattedCitation":"(Beese, Negishi, & Levin, 2009)","plainCitation":"(Beese, Negishi, & Levin,
2009)"},"citationItems":[{"id":78,"uris":["http://zotero.org/users/1031436/items/4WATGD54"],"uri":["http://zotero.org/users/1031436/items/4WATGD54"],"itemData":{"id":78,"type":"article-journal","title":"Identification of Positive Regulators of the
Yeast Fps1 Glycerol Channel","container-title":"PLoS Genet","page":"e1000738","volume":"5","issue":"11","source":"PLoS Genet","abstract":"Author Summary\nWhen challenged by changes in extracellular osmolarity, many fungal species regulate their
intracellular glycerol concentration to modulate their internal osmotic pressure. Maintenance of osmotic homeostasis prevents either cellular collapse under hyper-osmotic stress or cell rupture under hypo-osmotic stress. In baker's yeast, the Fps1
glycerol channel functions as the main vent for glycerol. Proper regulation of Fps1 is critical to the maintenance of osmotic homeostasis. In this study, we identify a pair of proteins (Rgc1 and Rgc2) that function as positive regulators of Fps1
activity. Their absence results in hyper-accumulation of glycerol and consequent cell lysis due to impaired Fps1 channel activity. Additionally, we found that these glycerol channel regulators function between the Hog1 (High Osmolarity Glycerol
response) signaling kinase and Fps1, defining a signaling pathway for control of glycerol efflux. Because members of the Rgc1/2 family are found among pathogenic fungal species, but not in humans, they represent potentially attractive targets for
antifungal drug development.","URL":"http://dx.doi.org/10.1371/journal.pgen.1000738","DOI":"10.1371/journal.pgen.1000738","journalAbbreviation":"PLoS Genet","author":[{"family":"Beese","given":"Sara
E."},{"family":"Negishi","given":"Takahiro"},{"family":"Levin","given":"David
E."}],"issued":{"date-parts":[["2009",11,26]]},"accessed":{"date-parts":[["2011",10,21]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}
</w:instrText>
</w:r>
Example file: zotero-cit.docx
Is there any way to target instrText
? When I try creating a style map for it, I get the following message:
Did not understand this style mapping, so ignored it: instrText => div.csl Error was at character number 1: Expected element type but got identifier “instrText”
If I could monkey patch this in a personal copy that would be fine too, but I couldn’t find any places in the code where instrText
is explicitly ignored, and I couldn’t figure out what to change.
(followup of https://github.com/mwilliamson/mammoth.js/issues/8#issuecomment-250647081)
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (4 by maintainers)
I know I’m not really using mammoth.js for its intended purpose, but for the curious:
I solved this for myself (https://github.com/rmzelle/ref-extractor/) by extending the variable “xmlElementReaders” to recognize
instrText
elements, and exporting the contents of each field to a global array variable: https://github.com/rmzelle/mammoth.js/commit/77bcac57a2f4f5095f7e8ae71419863dbdb3bc26#diff-68b60b2443cf3be90e7e7223aaf3d383It requires me to use a customized version of mammoth.js, but it seemed the easiest way to allow me to:
a) access the content of
instrText
elements b) post-process the content of eachinstrText
element c) ignore any other content in the Word documentOkey doke, closing the issue.