question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Silent data loss when response contains characters invalid for XML PCDATA

See original GitHub issue

Hi @MartijnR, some people are copying text from PDFs that contains old ASCII control characters and pasting it as responses to questions. I’m guessing that, depending on their platform, these characters might be invisible to them.

The problem is that these characters break submissions and draft saving/loading, but there’s not an indication of an error until it’s too late to retrieve the responses already entered.

Form used to test (single text question)

<?xml version="1.0" encoding="utf-8"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <h:head>
    <h:title>text</h:title>
    <model>
      <instance>
        <text id="text">
          <formhub>
            <uuid/>
          </formhub>
          <text/>
          <meta>
            <instanceID/>
          </meta>
        </text>
      </instance>
      <bind nodeset="/text/text" type="string"/>
      <bind calculate="concat('uuid:', uuid())" nodeset="/text/meta/instanceID" readonly="true()" type="string"/>
      <bind calculate="'dbfdf102b9e74a69abdcc8527a702275'" nodeset="/text/formhub/uuid" type="string"/>
    </model>
  </h:head>
  <h:body>
    <input ref="/text/text">
      <label>text</label>
    </input>
  </h:body>
</h:html>

Reproduce via drafts

  1. Open the form;
  2. Paste something funky as the text response, e.g. whatthehey;
    • ℹ️ It looks like GitHub is not preserving the characters. Perhaps generate the text response with console.log('what' + String.fromCharCode(4) + 'the' + String.fromCharCode(28) + 'hey');
  3. Save the submission as a draft;
  4. Notice no errors in the UI or developer console;
  5. Attempt to load the draft;
  6. See “Loading Error…Error trying to parse XML record. Invalid XML source”.

Reproduce via submission

  1. Use Google Chrome†;
  2. Create funky text response as above;
  3. Submit;
  4. See parseXML error in the console;
  5. See “text - 1 was successfully submitted” in the UI;
  6. Notice (using the browser’s developer tools) that the POSTed form data contains a valid root node and form UUID:
    ------WebKitFormBoundaryBGt3di99CAJQp9hF
    Content-Disposition: form-data; name="xml_submission_file"; filename="xml_submission_file"
    Content-Type: text/xml
    
    <text xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="text"><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 5 at column 21: PCDATA invalid Char value 4</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror>
              <formhub>
                <uuid>dbfdf102b9e74a69abdcc8527a702275</uuid>
              </formhub>
              <text/></text>
    ------WebKitFormBoundaryBGt3di99CAJQp9hF--
    
  7. Notice, at least in the case of KoBoCAT, that this incomplete XML is indeed accepted as a valid submission, and a 201 response is returned to Enketo;
  8. Observe that Enketo has deleted the submission from browser storage, as is typical after a successful upload.

† Firefox renders the XML parsererror differently, in a way that doesn’t resemble valid submission XML and isn’t accepted by KoBoCAT.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
tinokcommented, Jul 2, 2020

EE already strips similar things like rich text formatting or images that may have been copied

The browser does that, not EE.

That’s a good point. Would be great if the browser stripped the control characters out as well when pasting. I guess we’ve documented now why we have to strip them in EE? 🤷

1reaction
tinokcommented, Jun 24, 2020

Update: I tested it with a source PDF and was able to reproduce the issue without needing to generate invalid chars in the console. I copied the text into this file: invalid characters.txt

The issue stems from the fact that ligature letters (like the combined ff or fi characters) are copied as these ASCII control characters. Here is the same abstract as in the text file with the ligatures highlighted: image

Read more comments on GitHub >

github_iconTop Results From Across the Web

What are invalid characters in XML - Stack Overflow
Yes, I am passing the string to a CMS called Fatwire and the node with the data cannot be in a CDATA, i'm...
Read more >
REST API Explorer - PCDATA invalid Char Value Error
The error message suggests the is an invalid character in the xml document. It's necessary to find the character and replace it. Another...
Read more >
A Roadmap to XML Parsers in Python - Real Python
In this tutorial, you'll learn what XML parsers are available in Python and how to pick the right parsing model for your specific...
Read more >
PJ36861: User will see an "invalid character in XML" error in ...
The Process Engine server and/or the database server are misconfigured and that is the real issue which must be resolved. But now, with...
Read more >
Examples of DTDs and XML Streams
Example: Using Special Characters and Reserved Words as Member Names ... Cookies Containing Sensitive Data ... Silently Installing Uniface on Windows.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found