question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Normalize characters with combining marks to precomposed characters

See original GitHub issue

I ran into a little weird problem which I wanted to solve. And here it is:

I have a PDF file with German Umlauts (������) and if I copy & paste them into the TinyMCE from WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.

This results in some problems:

Search for words with umlauts doesn't work
Proofreading fails
W3C validation fails with warning "Text run is not in Unicode Normalization Form C." because precomposed characters are prefered (See: http://www.w3.org/International/docs/charmod-norm/#choice-of-normalization-form) 

Solution: I made a proof-of-concept with the “content_save_pre” filter in WordPress and it works. In this proof-of-concept I just replaced the two characters with the precomposed character:

$content = str_replace( “a\xCC\x88”, “�”, $content ); $content = str_replace( “o\xCC\x88”, “�”, $content ); $content = str_replace( “u\xCC\x88”, “�”, $content ); $content = str_replace( “A\xCC\x88”, “�”, $content ); $content = str_replace( “O\xCC\x88”, “�”, $content ); $content = str_replace( “U\xCC\x88”, “�”, $content );

If we could (I know we can’t, because WP is still supporting PHP 5.2) rely on PHP 5.3+ we could use a function for that: http://php.net/manual/de/normalizer.normalize.php

So the above code would be just one line and much more general: $content = normalizer_normalize($content, Normalizer::FORM_C );

Fun facts: The problem is just on Mac OS X (Lion, 10.7.5 and 10.9.5) for me (on Ubuntu 14.04 or Win 7 I couldn’t reproduce the problem).

Description of problem: Pasting from non-precomposed characters

Steps to reproduce:

  1. Paste Text from attached PDF (which don’t work - because the file extension is not allowed 😦 )

Expected result: Precomposed characters

Actual result: vowels plus diaeresis (vowel plus combining character)

I submitted a bugreport (incl patch) for WordPress, but maybe this can be solved better in TinyMCE.

Legacy information imported from TinyMCE bug tracker:

#T7243 posted by zodiac1978

Tags: [firefox msie safari chrome] Status: Open Resolution: None Attached URL: none

Issue Analytics

  • State:open
  • Created 9 years ago
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
Zodiac1978commented, Dec 28, 2015

With ES6 we have a normalize function in JS: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

And there is a polyfill for older browsers: https://github.com/walling/unorm

0reactions
Zodiac1978commented, Aug 20, 2020

Still valid issue. Please re-open.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Normalize characters with combining marks to precomposed ...
I have a PDF file with German Umlauts (üöäÜÖÄ) and if I copy & paste them into WordPress I get the vowel (uoaUOA)...
Read more >
FAQ - Characters and Combining Marks - Unicode
Characters and Combining Marks. ... Therefore, the normalized NFC representation of any new precomposed letters would still use decomposed sequences, ...
Read more >
Combining character - Wikipedia
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin ...
Read more >
I18N/CanonicalNormalizationIssues - W3C Wiki
NFC normalization is a set of rules for converting strings containing characters such as those above to the most-combined (composed) form (e.g., U+00E1...
Read more >
Combining character - Wikiwand
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found