Silent data loss when response contains characters invalid for XML PCDATA
See original GitHub issueHi @MartijnR, some people are copying text from PDFs that contains old ASCII control characters and pasting it as responses to questions. I’m guessing that, depending on their platform, these characters might be invisible to them.
The problem is that these characters break submissions and draft saving/loading, but there’s not an indication of an error until it’s too late to retrieve the responses already entered.
Form used to test (single text question)
<?xml version="1.0" encoding="utf-8"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<h:head>
<h:title>text</h:title>
<model>
<instance>
<text id="text">
<formhub>
<uuid/>
</formhub>
<text/>
<meta>
<instanceID/>
</meta>
</text>
</instance>
<bind nodeset="/text/text" type="string"/>
<bind calculate="concat('uuid:', uuid())" nodeset="/text/meta/instanceID" readonly="true()" type="string"/>
<bind calculate="'dbfdf102b9e74a69abdcc8527a702275'" nodeset="/text/formhub/uuid" type="string"/>
</model>
</h:head>
<h:body>
<input ref="/text/text">
<label>text</label>
</input>
</h:body>
</h:html>
Reproduce via drafts
- Open the form;
- Paste something funky as the text response, e.g.
whatthehey
;- ℹ️ It looks like GitHub is not preserving the characters. Perhaps generate the text response with
console.log('what' + String.fromCharCode(4) + 'the' + String.fromCharCode(28) + 'hey')
;
- ℹ️ It looks like GitHub is not preserving the characters. Perhaps generate the text response with
- Save the submission as a draft;
- Notice no errors in the UI or developer console;
- Attempt to load the draft;
- See “Loading Error…Error trying to parse XML record. Invalid XML source”.
Reproduce via submission
- Use Google Chrome†;
- Create funky text response as above;
- Submit;
- See
parseXML error
in the console; - See “text - 1 was successfully submitted” in the UI;
- Notice (using the browser’s developer tools) that the
POST
ed form data contains a valid root node and form UUID:------WebKitFormBoundaryBGt3di99CAJQp9hF Content-Disposition: form-data; name="xml_submission_file"; filename="xml_submission_file" Content-Type: text/xml <text xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="text"><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 5 at column 21: PCDATA invalid Char value 4</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror> <formhub> <uuid>dbfdf102b9e74a69abdcc8527a702275</uuid> </formhub> <text/></text> ------WebKitFormBoundaryBGt3di99CAJQp9hF--
- Notice, at least in the case of KoBoCAT, that this incomplete XML is indeed accepted as a valid submission, and a 201 response is returned to Enketo;
- Observe that Enketo has deleted the submission from browser storage, as is typical after a successful upload.
† Firefox renders the XML parsererror
differently, in a way that doesn’t resemble valid submission XML and isn’t accepted by KoBoCAT.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
What are invalid characters in XML - Stack Overflow
Yes, I am passing the string to a CMS called Fatwire and the node with the data cannot be in a CDATA, i'm...
Read more >REST API Explorer - PCDATA invalid Char Value Error
The error message suggests the is an invalid character in the xml document. It's necessary to find the character and replace it. Another...
Read more >A Roadmap to XML Parsers in Python - Real Python
In this tutorial, you'll learn what XML parsers are available in Python and how to pick the right parsing model for your specific...
Read more >PJ36861: User will see an "invalid character in XML" error in ...
The Process Engine server and/or the database server are misconfigured and that is the real issue which must be resolved. But now, with...
Read more >Examples of DTDs and XML Streams
Example: Using Special Characters and Reserved Words as Member Names ... Cookies Containing Sensitive Data ... Silently Installing Uniface on Windows.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
That’s a good point. Would be great if the browser stripped the control characters out as well when pasting. I guess we’ve documented now why we have to strip them in EE? 🤷
Update: I tested it with a source PDF and was able to reproduce the issue without needing to generate invalid chars in the console. I copied the text into this file: invalid characters.txt
The issue stems from the fact that ligature letters (like the combined ff or fi characters) are copied as these ASCII control characters. Here is the same abstract as in the text file with the ligatures highlighted: