Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Library throws "URI malformed" error when creating patch with emojis

See original GitHub issue

The following code:

const DiffMatchPatch = require('diff-match-patch');
const dmp = new DiffMatchPatch();

const patchText = dmp.patch_toText(dmp.patch_make('', '👨‍🦰 👨🏿‍🦰 👨‍🦱 👨🏿‍🦱 🦹🏿‍♂️'));
const patchObj = dmp.patch_fromText(patchText);
const [patchedText] = dmp.patch_apply(patchObj, '');
dmp.patch_toText(dmp.patch_make(patchedText, '👾 🙇 💁 🙅 🙆 🙋 🙎 🙍'));

Will throw an error “URI Malformed” at this line. That’s often the problem when using encodeURI on arbitrary data (the md5 package has the same problem) but in that case as far as I can see the inputs are valid UTF-8.

I think either patch_make or patch_apply generates invalid text.

But also I’m wondering why is encodeURI needed in this lib? Wouldn’t a simple escape/unescape of specific reserved characters be enough?

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

1reaction

michal-kurzcommented, Oct 1, 2022

I got somewhere. Take it with a grain of salt though, as I might have easily missed something 😃

The problem appears when multi-character unicode emojis get broken up into separate diffs inside a patch:

'💛'.length    // 2
'💛'.charAt(0) // \ud83d
'💛'.charAt(1) // \udc9b

encodeURI("\ud83d")  // malformed uri error
encodeURI("\udc9b")  // malformed uri error
encodeURI("\uD83D\udc9b")  // '%F0%9F%92%9B'

This mostly happens with internally generated prefixes/suffixes, for example inside patch_addContext_ - dmp “allocates” a chunk which starts/ends in the middle of an emoji. But it can occur inside an actual diff, too.

It seems to me that the problem can be solved by replacing encodeURI/decodeURI with escape/unescape (or other prefered escape method) inside patch_fromText and toString, have you tried this? It seems to work fine for my use-case - I hope it doesn’t break something else. encodeURI feels quite out of place anyway, since the code deals with abstract text, and not URIs.

Before I tried this, I also tinkered with the source code for quite a while, and I did seemingly manage to prevent such emoji splitting - in patch_addContext_, by testing if encodeURI throws, and increasing the padding and shifting the preffix end/suffix start until it didn’t. But this really is more of a desperate hack than anything else. Such approach may be able to fix emojis breaking up, but dmp would still throw as soon as an invalid character would appear for any other reason (other characters can throw too).

0reactions

laurent22commented, Oct 5, 2022

Thanks for looking into it @michal-kurz. It seems unlikely that any change will be merged to the official repository (the PR is from 2019) - do you know if there’s a good fork being maintained somewhere where that kind of fix could be applied?

Top Results From Across the Web

Devtools yield "URI Malformed" error when unicode emoji ...

Issue details When invoking the Prosemirrors devtools via applyDevTools(view), Chrome yields multiple "URI malformed" exceptions if one ...

URIError: malformed URI sequence - MDN Web Docs - Mozilla

The JavaScript exception "malformed URI sequence" occurs when URI encoding or decoding wasn't successful.

jQuery "Uncaught URIError: URI malformed" Error with ...

What I think happens in my case is, when the user changes the keyboard layout to enter emoticons, the encoding changes from UTF-8...

Fix list for IBM WebSphere Application Server Liberty

Fixes for WebSphere Application Server Liberty are delivered in fix packs periodically. This is a complete listing of all the fixes for Liberty...

Bug listing with status UNCONFIRMED as at 2022/12/28 19 ...

Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from coreutils not sys-apps/net-tools" status:UNCONFIRMED resolution: severity:enhancement ...