Library throws "URI malformed" error when creating patch with emojis
See original GitHub issueThe following code:
const DiffMatchPatch = require('diff-match-patch');
const dmp = new DiffMatchPatch();
const patchText = dmp.patch_toText(dmp.patch_make('', '👨🦰 👨🏿🦰 👨🦱 👨🏿🦱 🦹🏿♂️'));
const patchObj = dmp.patch_fromText(patchText);
const [patchedText] = dmp.patch_apply(patchObj, '');
dmp.patch_toText(dmp.patch_make(patchedText, '👾 🙇 💁 🙅 🙆 🙋 🙎 🙍'));
Will throw an error “URI Malformed” at this line. That’s often the problem when using encodeURI on arbitrary data (the md5 package has the same problem) but in that case as far as I can see the inputs are valid UTF-8.
I think either patch_make
or patch_apply
generates invalid text.
But also I’m wondering why is encodeURI needed in this lib? Wouldn’t a simple escape/unescape of specific reserved characters be enough?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
Devtools yield "URI Malformed" error when unicode emoji ...
Issue details When invoking the Prosemirrors devtools via applyDevTools(view), Chrome yields multiple "URI malformed" exceptions if one ...
Read more >URIError: malformed URI sequence - MDN Web Docs - Mozilla
The JavaScript exception "malformed URI sequence" occurs when URI encoding or decoding wasn't successful.
Read more >jQuery "Uncaught URIError: URI malformed" Error with ...
What I think happens in my case is, when the user changes the keyboard layout to enter emoticons, the encoding changes from UTF-8...
Read more >Fix list for IBM WebSphere Application Server Liberty
Fixes for WebSphere Application Server Liberty are delivered in fix packs periodically. This is a complete listing of all the fixes for Liberty...
Read more >Bug listing with status UNCONFIRMED as at 2022/12/28 19 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from coreutils not sys-apps/net-tools" status:UNCONFIRMED resolution: severity:enhancement ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I got somewhere. Take it with a grain of salt though, as I might have easily missed something 😃
The problem appears when multi-character unicode emojis get broken up into separate diffs inside a patch:
This mostly happens with internally generated prefixes/suffixes, for example inside
patch_addContext_
- dmp “allocates” a chunk which starts/ends in the middle of an emoji. But it can occur inside an actual diff, too.It seems to me that the problem can be solved by replacing
encodeURI
/decodeURI
withescape
/unescape
(or other prefered escape method) inside patch_fromText and toString, have you tried this? It seems to work fine for my use-case - I hope it doesn’t break something else.encodeURI
feels quite out of place anyway, since the code deals with abstract text, and not URIs.Before I tried this, I also tinkered with the source code for quite a while, and I did seemingly manage to prevent such emoji splitting - in
patch_addContext_
, by testing ifencodeURI
throws, and increasing the padding and shifting the preffix end/suffix start until it didn’t. But this really is more of a desperate hack than anything else. Such approach may be able to fix emojis breaking up, but dmp would still throw as soon as an invalid character would appear for any other reason (other characters can throw too).Thanks for looking into it @michal-kurz. It seems unlikely that any change will be merged to the official repository (the PR is from 2019) - do you know if there’s a good fork being maintained somewhere where that kind of fix could be applied?