Strings are not conforming to the RSS spec for valid chars.
See original GitHub issueThe RSS spec specifies exactly which characters are considered valid in RSS:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
At present, this library doesn’t handle ensuring that the strings it outputs conform to the spec. This means that the RSS feeds that are generated can easily become broken. We’re having this problem in Ghost, when users copy & paste data from elsewhere - things like form feed and other control characters are completely invisible, but cause the RSS feed to become invalid & unusable.
There is some interesting information around about fixing this sort of problem:
http://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters http://stackoverflow.com/questions/2670037/how-to-remove-invalid-utf-8-characters-from-a-javascript-string
And here’s an example regex that I have been trying out for fixing the issue:
/(?![\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD])./g
Here it is in action:
https://regex101.com/r/pQ7aB6/1
I have a branch with this implemented in Ghost, and it seems to work ok: https://github.com/ErisDS/Ghost/commit/7acb3f9df3e7f2cec54eae8173de6a3947bfaaf8
This seems to work well, the only question is whether the regex is a bit too naive / slow / memory intensive for use in a library like node-rss?
I’d be happy to PR a fix to node-rss, but interested to get some feedback on the regex and whether a different approach might be better.
Issue Analytics
- State:
- Created 8 years ago
- Comments:6
Top GitHub Comments
Would be great to get some feedback on this, and see if we could move it forward.
That’s all well and good - but a unit test is a self-fulfilling prophecy, it only makes sense when you’re certain the concept is correct, which I am not 😉